Regarding splitting data into gearboxes

Hadoop - The Ultimate Guide (Tom White). Shuffling and sorting the section: side of the card. Only after the numbers 6-4

Before writing to disk, the stream first divides the data into sections corresponding to the reducers that they will ultimately send. In each section, the background thread sorts differently in memory and, if there is a combiner function, it starts at the sort output.

Question:

Does this mean that the card writes each key output to a different file, and then combines them later. Thus, if two different outputs were sent to the gearbox, each separate key will be sent separately to the gearbox instead of sending one file.

If my above reasoning is incorrect, this is actually happening.

+4
source share
3 answers

Only if two key outputs go to different gearboxes. If the section considers that they should pass to the same reducer, they will be in one file.

- Updated to include more details - Mostly from the book:

The separator just sorts the keys in buckets. From 0 to n for the number of gears in your work. The reduction task has a small number of copy streams, so that it can display the card outputs in parallel. Therefore, for a given job, jobtracker knows the correspondence between the map pins and hosts. The thread in the reducer periodically requests a master for the card output hosts until it receives all of them.

The card outputs are copied to the JVM memory of the reduced task if they are small enough (the size of the buffers is controlled by mapred.job.shuffle.input.buffer.percent, which determines the fraction of the heap used for this purpose); otherwise they are copied to disk. When the buffer in memory reaches the threshold size (controlled by mapred.job.shuffle.merge.percent) or reaches the threshold number of card outputs (mapred.inmem.merge.threshold), it merges and spills onto the disk. If a combiner is specified, it will be launched during the merge to reduce the amount of data written to disk.

When copies accumulate on disk, the background stream merges them into larger, sorted files. This will save some time for merging. Please note that any card outputs compressed (by the task of the card) must be unpacked in memory in order to merge with them.

When all the card outputs have been copied, the reduction task goes into the sorting phase (which should be correctly called the merge phase, since sorting was performed on the side of the card), which combines the card outputs, supporting their sorting. This is done in rounds. For example, if there were 50 card exits, and the merge factor was 10 (by default, controlled by the io.sort.factor property, as well as when merging cards), there would be five rounds. Each round would combine 10 files into one, so at the end there would be five intermediate files.

Instead of having the final round that combines these five files into one sorted file, the merge saves the transition to disk, directly serving the reduction function to the last phase: the reduction phase. This final merger can come from a mixture of segments in memory and on disk.

+4
source

If we configured several gearboxes, then during the separation, if we get the keys for different gearboxes, they will be stored in separate files corresponding to the gearbox, and at the end of the map job the full file will be sent to the gearbox, and not to one key.

+1
source

Let's say you have 3 gears working. You can then use the delimiter to decide which keys go to which of the three reducers. You can probably do X% 3 in the delimiter to decide which key goes to which reducer. Hadoop uses the HashPartitioner by default .

+1
source

Source: https://habr.com/ru/post/1485145/


All Articles