Only if two key outputs go to different gearboxes. If the section considers that they should pass to the same reducer, they will be in one file.
- Updated to include more details - Mostly from the book:
The separator just sorts the keys in buckets. From 0 to n for the number of gears in your work. The reduction task has a small number of copy streams, so that it can display the card outputs in parallel. Therefore, for a given job, jobtracker knows the correspondence between the map pins and hosts. The thread in the reducer periodically requests a master for the card output hosts until it receives all of them.
The card outputs are copied to the JVM memory of the reduced task if they are small enough (the size of the buffers is controlled by mapred.job.shuffle.input.buffer.percent, which determines the fraction of the heap used for this purpose); otherwise they are copied to disk. When the buffer in memory reaches the threshold size (controlled by mapred.job.shuffle.merge.percent) or reaches the threshold number of card outputs (mapred.inmem.merge.threshold), it merges and spills onto the disk. If a combiner is specified, it will be launched during the merge to reduce the amount of data written to disk.
When copies accumulate on disk, the background stream merges them into larger, sorted files. This will save some time for merging. Please note that any card outputs compressed (by the task of the card) must be unpacked in memory in order to merge with them.
When all the card outputs have been copied, the reduction task goes into the sorting phase (which should be correctly called the merge phase, since sorting was performed on the side of the card), which combines the card outputs, supporting their sorting. This is done in rounds. For example, if there were 50 card exits, and the merge factor was 10 (by default, controlled by the io.sort.factor property, as well as when merging cards), there would be five rounds. Each round would combine 10 files into one, so at the end there would be five intermediate files.
Instead of having the final round that combines these five files into one sorted file, the merge saves the transition to disk, directly serving the reduction function to the last phase: the reduction phase. This final merger can come from a mixture of segments in memory and on disk.