In which part / class of mapreduce is the logic to stop the execution of tasks

In Hadoop MapReduce, no reducers start before all the cartographers are finished. Can someone please explain to me in which part / class / codec this logic is used? I'm talking about Hadoop MapReduce version 1 (NOT Yarn). I was looking for a map reduction scheme, but there are so many classes, and I do not really understand the method calls and their order.

In other words, I need (first for testing) that the gearboxes begin to decrease, even if there are still cartographers working. I know that in this way I get false results for the work, but I know that this is the beginning of some work on changing parts of the framework. So where should I start looking and making changes?

+1
source share
2 answers

This is done in the shuffling phase. For Hadoop 1.x, take a look at org.apache.hadoop.mapred.ReduceTask.ReduceCopier , which implements ShuffleConsumerPlugin . You can also read the Breaking the MapReduce Stage Barrier research paper by Verma et al.

EDIT:

After reading the @ chris-white answer, I realized that my answer needs further explanation. In the MapReduce model, you need to wait for the completion of all cartographers, since the keys need to be grouped and sorted; plus, you may have some speculative counters running, and you still donโ€™t know which of the duplicate cards will be completed first. However, as MapReduce Stage Barrier Card Interruption indicates, for some applications it may make sense not to wait for all output to be output. If you want to implement this behavior (most likely for research purposes), then you should take a look at the classes that I mentioned above.

+3
source

Some points to clarify:

The reducer cannot begin to decrease until all the cartographers have finished, their sections are copied to the node where the reducer task is performed, and finally, sorted.

What you can see is a reducer that keeps a copy of the output from the map, while other map tasks still work. This is controlled by a configuration property known as mapred.reduce.slowstart.completed.map ( mapred.reduce.slowstart.completed.map ). This value is the ratio (0.0 - 1.0) of the number of map tasks that must be completed before running the reducer tasks (copying by map outputs from completed completed map tasks). The default value is usually around 0.9, which means that if you have 100 tasks on the map for your work, 90 of them will need to be completed before the task tracker can start running the reduction tasks.

All this is controlled by the job tracker, in the JobInProgress class, lines 775, 1610, 1664.

+2
source

Source: https://habr.com/ru/post/1482127/


All Articles