How to guarantee that the combiner works at least once in map / reduce?

From some articles, I know that the combiner will work on the card side and on the gear side, and it will work 0 ~ N times. And I know that our mapreduce program should get the same result, regardless of whether the combiner is called.

But I have one special situation that requires the adder to be called at least once, does anyone know how to make sure that?

PS., In maptask.java I saw the line:

 if (null == combinerClass || numSpills < minSpillsForCombine) { Merger.writeFile(kvIter, writer, reporter); } else { combineCollector.setWriter(writer); combineAndSpill(kvIter, combineInputCounter); } 

If I set minSpillsForCombine to zero, can I make sure that the adder will be called at least once?

Thanks a lot!

+4
source share
2 answers

If you need a combiner to run at least once, you are using the combiner incorrectly. Its role is strictly optional, folding values ​​that are associative / commutative in nature. If you said more about why, perhaps you can offer a better design.

+2
source
  • During a spill, before the stream stream is written to disk, the stream first divides the data into sections corresponding to the reducers to which they will ultimately be sent.
  • Inside each section, the background thread sorts differently in memory, and if there is a combiner function, it starts at the output of the sort .
  • If there are at least three spill files, the combiner starts up again before the output file is written.
  • You can change this magic number 3 by overriding the property: mapreduce.map.combine.minspills
  • Combined can be run multiple times above the input, without affecting the final result.
  • If there is only one or two spills, a potential reduction in the size of the output map file is not worth the overhead of calling a combiner .

Hope this helps.

+1
source

Source: https://habr.com/ru/post/1493345/


All Articles