Parallel conversions to RDD in foreachDD functions of Spark DStream

Question

Parallel conversions to RDD in foreachDD functions of Spark DStream

The following code turns out that the fn1 and fn2 functions are applied to inRDD sequentially, as I see in the Stages section of Spark Web UI.

DstreamRDD1.foreachRDD(new VoidFunction<JavaRDD<String>>() { public void call(JavaRDD<String> inRDD) { inRDD.foreach(fn1) inRDD.foreach(fn2) } }

How different when a streaming job is done this way. The following functions work in parallel on the input Dstream?

 DStreamRDD1.foreachRDD(fn1) DStreamRDD2.foreachRDD(fn2)

+5

java apache-spark rdd spark-streaming dstream

Vishamdi Nov 22 '16 at 0:55

source share

1 answer

Yuval Itzchakov · Accepted Answer · 2016-11-22T09:58:02+0000

Both foreach on RDD and foreachRDD on DStream will be executed sequentially because they are output transformations, which means that they cause the materialization of the chart. This would not be the case for any general lazy conversion to Spark that can work in parallel when the execution schedule diverges into several separate steps.

For instance:

 dStream: DStream[String] = ??? val first = dStream.filter(x => x.contains("h")) val second = dStream.filter(x => !x.contains("h")) first.print() second.print()

The first part should not be performed sequentially when you have enough cluster resources for parallel operation of the basic stages. Then calling count , which is the output conversion again, will cause the print statements to print one after the other.

Parallel conversions to RDD in foreachDD functions of Spark DStream

More articles: