Does the Shuffle step in MapReduce run in parallel with Mapping?

I tried to understand the MapReduce program. In doing this, I noticed that the reduction tasks that begin almost immediately after all tasks are completed are complete. Now this is surprising because pruning tasks work with data that is grouped by key, which means that a random step is performed between them. The only way this can happen is that shuffling is done in parallel with the display.

Secondly, if the shuffling is actually performed in parallel with the display, what is equivalent to that in Apache Spark? Can matching and grouping by keys and / or sort in parallel?

+5
source share
1 answer

Hadoop MapReduce is not just a map and reduction steps, there are additional steps, such as combinators (map reduction) and merging, as shown below (taken from http://www.bodhtree.com/blog/2012/10/18/ever -wondered-what-happens-between-map-and-reduce / ) source: http://www.bodhtree.com/blog/2012/10/18/ever-wondered-what-happens-between-card-and-convolution / While the cards are still running and as they give out the keys, these keys can be routed and combined, and all the information needed for some bucket reductions is completed on the temporary card, it can already be processed and ready for reduction.

Spark creates the DAG (direct acyclic graph) of the phases necessary for processing, and groups them into stages where the data must be shuffled between nodes. Unlike Hadoop, where data is pushed during the map, light reducers extract the data and thus only do it when it starts (on the other hand, Spark tries to run more in memory (against the disk) and work with DAG, iterates better )

Alexey Grishchenko has a good explanation of Spark Shuffle here (note that with Spark 2 there is only shuffle sorting)

+5
source

Source: https://habr.com/ru/post/1266332/


All Articles