Are stages in an application parallel to sparks?

Question

Are stages in an application parallel to sparks?

I have doubts about how the steps are performed in the spark application. Is there any consistency in the implementation of the steps that can be determined by the programmer or they will be obtained using the spark engine?

+5

scala bigdata apache-spark

A srinivas Dec 27 '16 at 7:08

source share

2 answers

How are the stages in the work of Spark

Job steps can be performed in parallel if there are no dependencies between them.

In Spark, stages are broken into borders. You have a shuffling stage, which is the boundary stage at which transformations are transformed, i.e. reduceByKey , and you have a result stage, which is stages that inevitably lead to a result without causing shuffling, that is, a map operation

(Image courtesy of Cloudera)

Since groupByKey is a shuffling stage, you see a split in the pink boxes that marks the border.

Inside, the stage is further divided into tasks. for example, in the figure above, the first line that makes textFile -> map -> filter can be divided into three tasks, one for each transformation.

When one output of the transforms is another input of the transforms, we need to perform sequential execution. But, if the stages are not connected with each other, that is, hadoopFile -> groupByKey -> map , they can work in parallel. As soon as they declare a relationship between them from this stage, they will continue to execute in series.

+3

Yuval Itzchakov Dec 27 '16 at 7:18

source share

mrsrinivas · Accepted Answer · 2016-12-27T07:25:50+0000

Check the objects (steps, sections) in this figure:

enter image description here

pic credits

Are the steps in the task (spark application?) Performed in parallel in sparks?

Yes, they can run in parallel if there is no consistent dependency.

Here, the Stage 1 and Stage 2 partitions can run in parallel, but not for Stage 0 partitions, since the dependency partitions in stages 1 and 2 must be processed.

Is there any sequence in the steps that can be determined by the programmer or will it be obtained using a spark engine?

The boundary of the stage is determined by the fact that data shuffling occurs between sections . (check the pink lines in the pic.)

Are stages in an application parallel to sparks?

More articles: