Spark Structured Streaming: Multiple Receivers

Question

Spark Structured Streaming: Multiple Receivers

1) We consume from kafka using structured streaming and write the processed data set to s3. We also want to write the processed data to kafka, moving forward, is it possible to do this from the same streaming request? (spark version 2.1.1)

2) In the logs, I see the result of the streaming request, and I have the duration of the JSON sample from the log, can someone please explain what is the difference between addBatch and getBatch ?

3) TriggerExecution - does it take time to process the extracted data and write to the receiver?

"durationMs" : { "addBatch" : 2263426, "getBatch" : 12, "getOffset" : 273, "queryPlanning" : 13, "triggerExecution" : 2264288, "walCommit" : 552 },

Regards aravias

+5

apache-spark structured-streaming

user2221654 Aug 11 '17 at 19:57

source share

1 answer

zsxwing · Accepted Answer · 2017-08-14T21:22:23+0000

1) Yes.

In Spark 2.1.1, you can use writeStream.foreach to write your data to Kafka. There is an example on this blog: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming. html

Or you can use Spark 2.2.0, which adds a Kafka shell to officially support Kafka.

2) getBatch measures how long it takes to create a DataFrame from the source. This is usually pretty fast. addBatch measures how long the DataFrame runs in the shell.

3) triggerExecution measures how long the trigger starts, usually almost the same as getOffset + getBatch + addBatch .

Spark Structured Streaming: Multiple Receivers

More articles: