Multiple windows of varying lengths in the Spark Streaming app

Question

Multiple windows of varying lengths in the Spark Streaming app

I would like to process a real-time data stream (from Kafka) using Spark Streaming. I need to compute various statistics from the input stream, and they need to compute for windows of varying lengths. For example, I might need to calculate the avg stat 'A' value in the last 5 minutes, and at the same time calculate the median for stat 'B' in the last 1 hour.

In this case, what is the recommended approach to using Spark Streaming? Below are a few options I could think of:

(i) Have one DStream from Kafka and create several DStream from it using the window () method. For each of these resulting DStreams, the windowDuration value will be set to different values. eg:

// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))

(ii) Launch individual Spark Streaming applications - one for each stat

Questions

I (i) seem to have a more effective approach. However, I have a few doubts about this:

As if streamA and streamB were represented in the basic data structure.
They will share data - since they come from KafkaDStream? Or will there be data duplication?
In addition, are there more efficient methods for handling such a use case.

Thanks in advance

+4

real-time analytics apache-spark apache-kafka spark streaming

jithinpt Jul 21 '15 at 21:15

source share

1 answer

huitseeker · Answer 1 · 2015-07-22T09:34:07+0000

(i) , , WindowedDStream, , , , , , , .

, , (2 ), . ( ).

, , - , . streamA 5 , , streamB , 10 .

, , , . streamA 5 streamB, ( window(Hours(1)), 60 5).

Multiple windows of varying lengths in the Spark Streaming app

More articles: