hasoop writes to the SequenceFile in in the format of a key-value pair (record). We have a large unlimited log file. Hadoop will split the file based on the block size and store it on multiple data nodes. Is it guaranteed that each pair of key values will be on the same block? or can we have a case where the key is in one block on node 1 and the value (or its parts) in the second block on node 2? If we can have useless splits, then what is the solution? synchronization marks?
Another question: Doesoop automatically record synchronization tokens, or should we write it manually?
source share