A very large pair of key values in Hadoop

Question

A very large pair of key values in Hadoop

I am new to Hadoop and my current program complexity is limited by wordcount complexity. I am trying to understand the fundamental architecture of Hadoop so that I can better design my solutions.

One of the big questions I have is how does Hadoop handle large pairs of key values at block boundaries? Suppose I have a pair of key values of 10 MB in size (for example, if the value is the entire 10 MB file), and suppose I use a sequence file. How does Hadoop handle this on its borders? Does he divide the division into two parts and save it in two different blocks or does he understand that the pair of key values is very large and, therefore, instead of being split, does it just create a new block for the whole pair of key values?

+4

key hadoop

Begaluruboy Dec 19 '11 at 11:00

source share

1 answer

Praveen sripati · Accepted Answer · 2011-12-19T16:02:08+0000

The default block size in HDFS is 64 MB. If the key / value pair is 10 MB, then it may / may not be divided into blocks.

If the first KV pair is 60 MB and the second is 10 MB. Then the second KV pair has only 4 MB of space remaining in the 1st block (when the block size is 64 MB). Thus, 4 MB of the second KB is stored in the 1st block, and the remaining 6 MB are stored in the second block.
If the first KV pair is 40 MB and the second is 10 MB. Then the second KV pair has 24 MB of space remaining in the first block (at block size if 64 MB). So, the second KV is completely stored in the 1st block and is not broken.

When using a SequenceFile, the mapper does not know where the recording begins in the block, so synchronization is automatically added to the SequenceFiles using the Hadoop infrastructure. According to Hadoop: The Ultimate Guide

A synchronization point is a point in the stream that can be used to re-synchronize with the recording boundary if the reader is “lost” - for example, after searching for an arbitrary position in the stream. Sync points are recorded by SequenceFile.Writer, which inserts a special record to mark the sync point every few records as a sequence file. Such records are small enough to bear only a modest storage overhead - less than 1%. Sync points are always consistent with the boundaries of the records.

When a map task starts processing a block, it will seek the first synchronization point and begin processing records from there. And when it reaches the end of the block, it will look for the first synchronization point of the next block, and data is transmitted to this point over the network to the processor for processing.

To summarize, Hadoop frameworks are recognized even if the record is broken into blocks.

A very large pair of key values ​​in Hadoop

More articles:

A very large pair of key values in Hadoop