You will need to write your own input format and recorder to ensure that the files are divided correctly around your record separator.
Basically, your record reader will need to look for its byte offset, scan forward (read lines) until it finds:
Begin ... string Begin ...- Read the lines to the next line
end ... and specify these lines between the beginning and end as input for the next entry
- He scans the end of the end of the split or finds EOF
This is similar in the algorithm to how Mahout XMLInputFormat treats multiline XML as input - in fact, you could modify this source of code to cope with your situation.
As mentioned in @irW's answer, NLineInputFormat is another option if your records have a fixed number of lines per record but are really inefficient for large files, since it must open and read the entire file in order to detect line offsets in the format of the getSplits() input format getSplits() .
source share