An in-depth understanding of the internal workings of the phases of a map on a map reduces work in hadoop?

I am reading Hadoop: The definitive guide 3rd edtition Tom White. This is a great resource for understanding the internal elements of Hadoop , especially Map-Reduce that interests me.

From the book, (Page 205):

Shuffle and sort

MapReduce ensures that the input for each gearbox will be sorted by key. The process by which the system sorts and transfers the card outputs to gearboxes as inputs is called shuffling.

What I am doing from this is that before the keys are sent to the gearbox, they are sorted, indicating that the output of the card's operation phases is sorted. note: I do not call it mapper, since the map phase includes both a cartographer (written by a programmer) and an integrated MR structure sorting mechanism.


Card side

Each task of the card has a circular memory buffer in which the output is written. By default, the buffer is 100 MB, the size of which can be adjusted by changing the io.sort.mb property. When the contents of the buffer reaches a certain threshold size (io.sort.spill.percent, the default value is 0.80 or 80%), the background stream will begin to spill the contents to disk. The card outputs will still be written to the buffer during the spill, but if the buffer fills up during this time, the card will block until the leak is completed.
Before writing to disk, the stream first divides the data into sections corresponding to the reducers that they will ultimately send. Inside each section, the main thread performs sorting differently in memory, and if there is a combiner function, it starts at the output of the sort. Running the combiner function makes the output of the card more compact, therefore there is less data to write to the local disk and to transfer to the gearbox.

My understanding of the paragraph above is that when mapper creates key-value pairs, the key-value pairs are partitioned and sorted. Hypothetical example:

consider mapper-1 for a word counting program:

 >mapper-1 contents partition-1 xxxx: 2 yyyy: 3 partition-2 aaaa: 15 zzzz: 11 

(Please note that in each section, the data is sorted by key, but it is not necessary that the data of the section and the data of section 2 be sequential)


Continuing reading the chapter:

Each time the memory buffer reaches the spill threshold, a new spill file is created, therefore, after the map task has written its last output record, there may be several spill files. Before the task is completed, the spill files will be combined into one partitioned and sorted output file . The io.sort.factor configuration property controls the maximum number of threads to merge at once; default is 10.

My understanding is here (please know the bold phrase in the paragraph that deceived me): As part of the map task, several files can be spilled onto the disk, but they are combined into one file, which still contains a section and is sorted. consider the same example as above:

Before performing one map task, its intermediate data can be:

Content mapper-1

 spill 1: spill 2: spill 2: partition-1 partition-1 partition-1 hhhh:5 xxxx: 2 xxxx: 3 mmmm: 2 yyyy: 3 yyyy: 7 yyyy: 9 partition-2 partition-2 partition-2 aaaa: 15 bbbb: 15 cccc: 15 zzzz: 10 zzzz: 15 zzzz: 13 

After completing the map task, the output from the cartographer will be one file (note that the three reset files listed above are added, but the combiner is not used if the combiner is not specified in the task):

 >Mapper-1 contents: partition-1: hhhh: 5 mmmm: 2 xxxx: 2 xxxx: 3 yyyy: 3 yyyy: 7 yyyy: 9 partition-2: aaaa: 15 bbbb: 15 cccc: 15 zzzz: 10 zzzz: 15 zzzz: 13 

therefore, here section-1 may correspond to gear-1. That is, the data corresponding to the above segment 1 is sent to gearbox-1, and the data corresponding to the segment of section 2 is sent to gearbox-2.

If my understanding is still correct,

  • how can I get an intermediate file that has both partitions and sorted data from the mapper output file.
  • It is interesting to note that only one handler does not produce sorted output that contradicts points that data sent to the reducer is not sorted. More here
  • Even if the combiner is not used, if only Mapper does not work: Read more here
+6
source share
1 answer

Map-only jobs work differently than Map-and-Reduce jobs. It is not inconsistent, just different.

how can I get an intermediate file that has both partitions and sorted data from the mapper output file.

You can not. Unable to retrieve chunks of data from MapReduce intermediate steps. The same is true for receiving data after a sector divider or after a reader with a write, etc.

It is interesting to note that only one handler does not produce sorted output, which contradicts points that data sent to the reducer is not sorted. More here

This does not contradict. Sorters are sorted because the reducer needs to be sorted in order to merge. If there are no gears, he has no reason to sort, so this is not so. This is the correct behavior because I do not want it to sort on the map only a task that would slow down my processing. I have never had a situation where I would like my output to the map to be sorted locally.

Even if the combiner is not used, if it is running. Not only Mapper: More details here

Combinators are optimization. There is no guarantee that they will actually work or what data. Combinators are mainly used to increase gearbox efficiency. So, again, like local sorting, combinators do not start if there are no reducers, because it has no reason.

If you want a behavior similar to a combinator, I suggest writing data to a buffer (possibly hashmap), and then writing locally aggregated data to a cleanup function that executes when Mapper shuts down. Be careful when using memory if you want to do this. This is the best approach, because combiners are listed as convenient optimization, and you cannot count on their implementation ... even when they are executed.

+8
source

Source: https://habr.com/ru/post/972721/


All Articles