Where do the combiners combine the output of the card - in the card phase or in the reduction phase in the work to reduce the card?

I got the impression that the combiners are like reducers that act on the local map task. That is, it combines the results of a separate card task to reduce the network bandwidth for transmitting output.

And from reading the Hadoop- The definitive guide 3rd edition my understanding seems right.

From chapter 2 (p. 34)

Combiner Functions Many MapReduce jobs are limited by the bandwidth available in the cluster, so it pays to minimize the data transferred between the cards and reduce tasks. Hadoop allows the user to set the combiner function, which must be run at the output of the card β€” the combiner output forms the input to the reduction function. Since the combiner function is an optimization, Hadoop does not guarantee how many times it will call it for a particular record of the map output file, if at all. In other words, when calling the combiner function, zero should give the same result from the gearbox one or more times.

So, I tried the following on the wordcount problem:

 job.setMapperClass(mapperClass); job.setCombinerClass(reduceClass); job.setNumReduceTasks(0); 

Here are the counters:

 14/07/18 10:40:15 INFO mapred.JobClient: Counters: 10 14/07/18 10:40:15 INFO mapred.JobClient: File System Counters 14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of bytes read=293 14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of bytes written=75964 14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of read operations=0 14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of large read operations=0 14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of write operations=0 14/07/18 10:40:15 INFO mapred.JobClient: Map-Reduce Framework 14/07/18 10:40:15 INFO mapred.JobClient: Map input records=7 14/07/18 10:40:15 INFO mapred.JobClient: Map output records=16 14/07/18 10:40:15 INFO mapred.JobClient: Input split bytes=125 14/07/18 10:40:15 INFO mapred.JobClient: Spilled Records=0 14/07/18 10:40:15 INFO mapred.JobClient: Total committed heap usage (bytes)=85000192 

and here part-m-00000 :

 hello 1 world 1 Hadoop 1 programming 1 mapreduce 1 wordcount 1 lets 1 see 1 if 1 this 1 works 1 12345678 1 hello 1 world 1 mapreduce 1 wordcount 1 

it’s clear that the combiner is not used. I understand that Hadoop does not guarantee that a combiner will be called at all. But when I turn on the decrease phase, the adder is combined.

WHY IS THIS BEHAVIOR?

Now that I have read chapter 6 (p. 208) on how MapReduce works . I see this paragraph described in the Reduce side .

The card outputs are copied to the JVM memory of the reduced task if they are small enough (the size of the buffers is controlled by mapred.job.shuffle.input.buffer.percent, which determines the fraction of the heap used for this purpose); otherwise they are copied to disk. When the buffer in memory reaches the threshold size (controlled by mapred.job.shuffle.merge.percent) or reaches the threshold number of card outputs (mapred.inmem.merge.threshold), it merges and spills onto the disk. If a combiner is specified, it will be launched during the merge to reduce the amount of data written to disk.

My conclusions from this paragraph: 1) The combiner ALSO works during the reduction phase.

+4
source share
2 answers

The main function of combiner is optimization. In most cases, it acts as a mini gearbox. On page 206 of the same book, chapter - How mapreduce (map side) works:

Starting the combiner function makes the card output more compact, therefore there is less data to write to the local disk and to transfer to the gearbox.

Quote from your question,

If a combiner is specified, it will be launched during the merge to reduce the amount of data written to disk.

Both quotation marks indicate that a combiner is done primarily for compactness. An advantage of this optimization is reduced network bandwidth for data transmission.

Also from the same book

Recall that combinators can be executed multiple times over an input without affecting the final result. If there is one or two spills, then the potential reduction in the size of the output card is not worth the overhead when you call the combiner, so it does not start again for this card output.

The value that hasoop does not guarantee how many times the adder starts (can also be zero)

The combinator never starts for card-only jobs. This makes sense because the combiner changes the output of the card. In addition, since it does not guarantee the number of times it is called, the card output is not guaranteed to be the same.

+5
source
  • The combiner does not start if this task is for the card only.

  • The combinator starts only if more than 3 spill files are written to the disk.

0
source

Source: https://habr.com/ru/post/972731/


All Articles