A very large pair of key values ​​in Hadoop

I am new to Hadoop and my current program complexity is limited by wordcount complexity. I am trying to understand the fundamental architecture of Hadoop so that I can better design my solutions.

One of the big questions I have is how does Hadoop handle large pairs of key values ​​at block boundaries? Suppose I have a pair of key values ​​of 10 MB in size (for example, if the value is the entire 10 MB file), and suppose I use a sequence file. How does Hadoop handle this on its borders? Does he divide the division into two parts and save it in two different blocks or does he understand that the pair of key values ​​is very large and, therefore, instead of being split, does it just create a new block for the whole pair of key values?

+4
source share
1 answer

The default block size in HDFS is 64 MB. If the key / value pair is 10 MB, then it may / may not be divided into blocks.

  • If the first KV pair is 60 MB and the second is 10 MB. Then the second KV pair has only 4 MB of space remaining in the 1st block (when the block size is 64 MB). Thus, 4 MB of the second KB is stored in the 1st block, and the remaining 6 MB are stored in the second block.

  • If the first KV pair is 40 MB and the second is 10 MB. Then the second KV pair has 24 MB of space remaining in the first block (at block size if 64 MB). So, the second KV is completely stored in the 1st block and is not broken.

When using a SequenceFile, the mapper does not know where the recording begins in the block, so synchronization is automatically added to the SequenceFiles using the Hadoop infrastructure. According to Hadoop: The Ultimate Guide

A synchronization point is a point in the stream that can be used to re-synchronize with the recording boundary if the reader is β€œlost” - for example, after searching for an arbitrary position in the stream. Sync points are recorded by SequenceFile.Writer, which inserts a special record to mark the sync point every few records as a sequence file. Such records are small enough to bear only a modest storage overhead - less than 1%. Sync points are always consistent with the boundaries of the records.

When a map task starts processing a block, it will seek the first synchronization point and begin processing records from there. And when it reaches the end of the block, it will look for the first synchronization point of the next block, and data is transmitted to this point over the network to the processor for processing.

To summarize, Hadoop frameworks are recognized even if the record is broken into blocks.

+4
source

Source: https://habr.com/ru/post/1387018/


All Articles