I am reading Hadoop: The definitive guide 3rd edtition Tom White. This is a great resource for understanding the internal elements of Hadoop , especially Map-Reduce that interests me.
From the book, (Page 205):
Shuffle and sort
MapReduce ensures that the input for each gearbox will be sorted by key. The process by which the system sorts and transfers the card outputs to gearboxes as inputs is called shuffling.
What I am doing from this is that before the keys are sent to the gearbox, they are sorted, indicating that the output of the card's operation phases is sorted. note: I do not call it mapper, since the map phase includes both a cartographer (written by a programmer) and an integrated MR structure sorting mechanism.
Card side
Each task of the card has a circular memory buffer in which the output is written. By default, the buffer is 100 MB, the size of which can be adjusted by changing the io.sort.mb property. When the contents of the buffer reaches a certain threshold size (io.sort.spill.percent, the default value is 0.80 or 80%), the background stream will begin to spill the contents to disk. The card outputs will still be written to the buffer during the spill, but if the buffer fills up during this time, the card will block until the leak is completed.
Before writing to disk, the stream first divides the data into sections corresponding to the reducers that they will ultimately send. Inside each section, the main thread performs sorting differently in memory, and if there is a combiner function, it starts at the output of the sort. Running the combiner function makes the output of the card more compact, therefore there is less data to write to the local disk and to transfer to the gearbox.
My understanding of the paragraph above is that when mapper creates key-value pairs, the key-value pairs are partitioned and sorted. Hypothetical example:
consider mapper-1 for a word counting program:
>mapper-1 contents partition-1 xxxx: 2 yyyy: 3 partition-2 aaaa: 15 zzzz: 11
(Please note that in each section, the data is sorted by key, but it is not necessary that the data of the section and the data of section 2 be sequential)
Continuing reading the chapter:
Each time the memory buffer reaches the spill threshold, a new spill file is created, therefore, after the map task has written its last output record, there may be several spill files. Before the task is completed, the spill files will be combined into one partitioned and sorted output file . The io.sort.factor configuration property controls the maximum number of threads to merge at once; default is 10.
My understanding is here (please know the bold phrase in the paragraph that deceived me): As part of the map task, several files can be spilled onto the disk, but they are combined into one file, which still contains a section and is sorted. consider the same example as above:
Before performing one map task, its intermediate data can be:
Content mapper-1
spill 1: spill 2: spill 2: partition-1 partition-1 partition-1 hhhh:5 xxxx: 2 xxxx: 3 mmmm: 2 yyyy: 3 yyyy: 7 yyyy: 9 partition-2 partition-2 partition-2 aaaa: 15 bbbb: 15 cccc: 15 zzzz: 10 zzzz: 15 zzzz: 13
After completing the map task, the output from the cartographer will be one file (note that the three reset files listed above are added, but the combiner is not used if the combiner is not specified in the task):
>Mapper-1 contents: partition-1: hhhh: 5 mmmm: 2 xxxx: 2 xxxx: 3 yyyy: 3 yyyy: 7 yyyy: 9 partition-2: aaaa: 15 bbbb: 15 cccc: 15 zzzz: 10 zzzz: 15 zzzz: 13
therefore, here section-1 may correspond to gear-1. That is, the data corresponding to the above segment 1 is sent to gearbox-1, and the data corresponding to the segment of section 2 is sent to gearbox-2.
If my understanding is still correct,
- how can I get an intermediate file that has both partitions and sorted data from the mapper output file.
- It is interesting to note that only one handler does not produce sorted output that contradicts points that data sent to the reducer is not sorted. More here
- Even if the combiner is not used, if only Mapper does not work: Read more here