I got the impression that the combiners are like reducers that act on the local map task. That is, it combines the results of a separate card task to reduce the network bandwidth for transmitting output.
And from reading the Hadoop- The definitive guide 3rd edition my understanding seems right.
From chapter 2 (p. 34)
Combiner Functions Many MapReduce jobs are limited by the bandwidth available in the cluster, so it pays to minimize the data transferred between the cards and reduce tasks. Hadoop allows the user to set the combiner function, which must be run at the output of the card β the combiner output forms the input to the reduction function. Since the combiner function is an optimization, Hadoop does not guarantee how many times it will call it for a particular record of the map output file, if at all. In other words, when calling the combiner function, zero should give the same result from the gearbox one or more times.
So, I tried the following on the wordcount problem:
job.setMapperClass(mapperClass); job.setCombinerClass(reduceClass); job.setNumReduceTasks(0);
Here are the counters:
14/07/18 10:40:15 INFO mapred.JobClient: Counters: 10 14/07/18 10:40:15 INFO mapred.JobClient: File System Counters 14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of bytes read=293 14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of bytes written=75964 14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of read operations=0 14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of large read operations=0 14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of write operations=0 14/07/18 10:40:15 INFO mapred.JobClient: Map-Reduce Framework 14/07/18 10:40:15 INFO mapred.JobClient: Map input records=7 14/07/18 10:40:15 INFO mapred.JobClient: Map output records=16 14/07/18 10:40:15 INFO mapred.JobClient: Input split bytes=125 14/07/18 10:40:15 INFO mapred.JobClient: Spilled Records=0 14/07/18 10:40:15 INFO mapred.JobClient: Total committed heap usage (bytes)=85000192
and here part-m-00000 :
hello 1 world 1 Hadoop 1 programming 1 mapreduce 1 wordcount 1 lets 1 see 1 if 1 this 1 works 1 12345678 1 hello 1 world 1 mapreduce 1 wordcount 1
itβs clear that the combiner is not used. I understand that Hadoop does not guarantee that a combiner will be called at all. But when I turn on the decrease phase, the adder is combined.
WHY IS THIS BEHAVIOR?
Now that I have read chapter 6 (p. 208) on how MapReduce works . I see this paragraph described in the Reduce side .
The card outputs are copied to the JVM memory of the reduced task if they are small enough (the size of the buffers is controlled by mapred.job.shuffle.input.buffer.percent, which determines the fraction of the heap used for this purpose); otherwise they are copied to disk. When the buffer in memory reaches the threshold size (controlled by mapred.job.shuffle.merge.percent) or reaches the threshold number of card outputs (mapred.inmem.merge.threshold), it merges and spills onto the disk. If a combiner is specified, it will be launched during the merge to reduce the amount of data written to disk.
My conclusions from this paragraph: 1) The combiner ALSO works during the reduction phase.