How to start sorting and reducing in hadoop before the taffet is completed for all cartographers?

I understand from When shrinking tasks begin in Hadoop, the shrinking task in hadoop consists of three steps: shuffle, sort, and shrink, where sorting (and then shrink) can only be started after all the cards have been executed. Is there a way to start sorting and shrink every time the display ends.

For example, we have only one task with maps mapperA and mapperB and 2 reducers. I want to do this:

  • mapperA ends
  • shuffles copies of the corresponding mapper sections. As a result, the conclusion allows you to talk with gear 1 and 2
  • sort by gears 1 and 2 starts sorting and decreasing and generates some intermediate output
  • now mapperB ends
  • shuffles copies the corresponding mapperBs output sections to reducer 1 and 2
  • sort and reduce on gear 1 and 2 starts again, and gear combines the new output with the old

Is it possible? Thanks

+4
source share
3 answers

You cannot with the current implementation. However, people β€œcracked” the Hadoop code to do what you want to do.

In the MapReduce model, you need to wait for the completion of all cartographers, since the keys need to be grouped and sorted; plus, you may have some speculative counters running, and you still don’t know which of the duplicate cards will be completed first.

However, since "Overcome the MapReduce Transition Barrier" indicates that for some applications it may make sense not to wait for all cartographers to exit. If you want to implement this behavior (most likely for research purposes), then you should take a look at the class org.apache.hadoop.mapred.ReduceTask.ReduceCopier , which implements ShuffleConsumerPlugin .

EDIT: Finally, as @teo points out in this related SO question ,

ReduceCopier.fetchOutputs() method is the one that contains the shortcut task from starting until all copies of the maps are copied (through a loop in line 2026 of the Hadoop 1.0.4 release).

+3
source

You can configure this using the slowstart property, which indicates the percentage of your cartographers that need to be completed before the copy starts. The usual default value is between 0.9 - 0.95 (90-95%), but you can override the value 0 if you want

 `mapreduce.reduce.slowstart.completed.map` 
+2
source

Starting the sorting process before all the mappers end is a kind of chaop antipattern (if I may say so!), In that reducers cannot know that no more data will be received until all the cartographers have finished. you, the investor, may know that, based on your definition of keys, separator, etc., but gearboxes do not.

+1
source

Source: https://habr.com/ru/post/1482125/


All Articles