SequenceFile sharing in a controlled manner - Hadoop

Question

SequenceFile sharing in a controlled manner - Hadoop

hasoop writes to the SequenceFile in in the format of a key-value pair (record). We have a large unlimited log file. Hadoop will split the file based on the block size and store it on multiple data nodes. Is it guaranteed that each pair of key values will be on the same block? or can we have a case where the key is in one block on node 1 and the value (or its parts) in the second block on node 2? If we can have useless splits, then what is the solution? synchronization marks?

Another question: Doesoop automatically record synchronization tokens, or should we write it manually?

+6

hadoop

Majid azimi Dec 6 '11 at 19:32

source share

1 answer

Majid azimi · Accepted Answer · 2011-12-06T21:11:26+0000

I asked this question on the mailing list. They said:

Sync tokens are written to the sequence files already; they are part of the format. This is nothing to worry about - and simple enough to experience and be sure. The mechanism is the same as reading a text file with newline characters - the reader will provide reading from the data boundary to complete the recording, if necessary.

then I asked:

So, if we have a map task that analyzes only the second block of the log file, it should not transfer any other parts from other nodes because this part is independent and means a complete split? I'm right?

They said:

Yes. Simply put, your notes will never break. We do not read simply at the boundaries of separation, we can go beyond the boundaries until synchronization, a marker is encountered to complete a recording or series of records. Subsequent cartographers will always skip to the first synchronization marker, and then start reading - to avoid duplication. This is exactly how reading a text file works - only here, this is a new line.

SequenceFile sharing in a controlled manner - Hadoop

More articles: