The default block size in HDFS is 64 MB. If the key / value pair is 10 MB, then it may / may not be divided into blocks.
If the first KV pair is 60 MB and the second is 10 MB. Then the second KV pair has only 4 MB of space remaining in the 1st block (when the block size is 64 MB). Thus, 4 MB of the second KB is stored in the 1st block, and the remaining 6 MB are stored in the second block.
If the first KV pair is 40 MB and the second is 10 MB. Then the second KV pair has 24 MB of space remaining in the first block (at block size if 64 MB). So, the second KV is completely stored in the 1st block and is not broken.
When using a SequenceFile, the mapper does not know where the recording begins in the block, so synchronization is automatically added to the SequenceFiles using the Hadoop infrastructure. According to Hadoop: The Ultimate Guide
A synchronization point is a point in the stream that can be used to re-synchronize with the recording boundary if the reader is βlostβ - for example, after searching for an arbitrary position in the stream. Sync points are recorded by SequenceFile.Writer, which inserts a special record to mark the sync point every few records as a sequence file. Such records are small enough to bear only a modest storage overhead - less than 1%. Sync points are always consistent with the boundaries of the records.
When a map task starts processing a block, it will seek the first synchronization point and begin processing records from there. And when it reaches the end of the block, it will look for the first synchronization point of the next block, and data is transmitted to this point over the network to the processor for processing.
To summarize, Hadoop frameworks are recognized even if the record is broken into blocks.
source share