SequenceFile sharing in a controlled manner - Hadoop

hasoop writes to the SequenceFile in in the format of a key-value pair (record). We have a large unlimited log file. Hadoop will split the file based on the block size and store it on multiple data nodes. Is it guaranteed that each pair of key values ​​will be on the same block? or can we have a case where the key is in one block on node 1 and the value (or its parts) in the second block on node 2? If we can have useless splits, then what is the solution? synchronization marks?

Another question: Doesoop automatically record synchronization tokens, or should we write it manually?

+6
source share
1 answer

I asked this question on the mailing list. They said:

Sync tokens are written to the sequence files already; they are part of the format. This is nothing to worry about - and simple enough to experience and be sure. The mechanism is the same as reading a text file with newline characters - the reader will provide reading from the data boundary to complete the recording, if necessary.

then I asked:

So, if we have a map task that analyzes only the second block of the log file, it should not transfer any other parts from other nodes because this part is independent and means a complete split? I'm right?

They said:

Yes. Simply put, your notes will never break. We do not read simply at the boundaries of separation, we can go beyond the boundaries until synchronization, a marker is encountered to complete a recording or series of records. Subsequent cartographers will always skip to the first synchronization marker, and then start reading - to avoid duplication. This is exactly how reading a text file works - only here, this is a new line.

+9
source

Source: https://habr.com/ru/post/903225/


All Articles