In the spark shell, I performed the following work:
val d = sc.parallelize(0 until 1000000).map(i => (i%100000, i)).persist d.join(d.reduceByKey(_ + _)).collect
The Spark user interface shows three steps. Steps 4 and 5 correspond to the calculation of d , and step 6 corresponds to the calculation of the action collect . Since d persists, I expect only two steps. However, at step 5, there are those not associated with any other steps.

So, I tried working with the same calculation without using persist, and the DAG looks the same, except without green dots indicating that the RDD is being saved.

I expect the output of step 11 to be connected to the input in step 12, but that is not the case.
If you look at the descriptions of the scene, the steps seem to indicate that d saved because there is input in step 5, but I'm still confused as to why step 5 even exists.


source share