How are the stages in the work of Spark
Job steps can be performed in parallel if there are no dependencies between them.
In Spark, stages are broken into borders. You have a shuffling stage, which is the boundary stage at which transformations are transformed, i.e. reduceByKey , and you have a result stage, which is stages that inevitably lead to a result without causing shuffling, that is, a map operation

(Image courtesy of Cloudera)
Since groupByKey is a shuffling stage, you see a split in the pink boxes that marks the border.
Inside, the stage is further divided into tasks. for example, in the figure above, the first line that makes textFile -> map -> filter can be divided into three tasks, one for each transformation.
When one output of the transforms is another input of the transforms, we need to perform sequential execution. But, if the stages are not connected with each other, that is, hadoopFile -> groupByKey -> map , they can work in parallel. As soon as they declare a relationship between them from this stage, they will continue to execute in series.
source share