How to handle multi-line recording for inputplit?

I have a 100 TB text file and it has multi-line writes. And we are not allowed to have so many lines in each record. One record can have a size of 5 lines, another can be 6 lines, another can be 4 lines. Not sure if the line size may differ for each entry.

Therefore, I can not use the standard TextInputFormat, I wrote my own inputformat and a custom reader, but my confusion is this: when splitting occurs, I am not sure that each section will contain a complete record. Some of the recordings may go in split 1, and the other in split 2. But this is wrong.

So, can you suggest how to handle this script so that I guarantee that my full record goes in one InputSplit?

Thanks in advance -JE

+4
source share
2 answers

You need to know if the records are really limited to some known sequence of characters.

If you know this, you can set the textinputformat.record.delimiter configuration textinputformat.record.delimiter to separate records.

If the records are not limited by a symbol, you will need additional logic, which, for example, counts a known number of fields (if there is a known number of fields) and presents it as a record. This usually makes things more complex, error prone, and slow as there is still a lot of text processing.

Try to determine if records are limited. It may help to publish a short example of several entries.

+2
source

In your recording device, you need to define an algorithm with which you can:

  • Determine if in the middle of the recording
  • How to scan this entry and read the next full entry

This is similar to what TextInputFormat LineReader already does - when the input delimiter has an offset, the linear reader will scan forward from this offset for the first new line that it finds, and then read the next record after this new line as it will be the first record emit. Associated with this, if the length of the block does not match the EOF, the linear reader will continue and end the block to find the line termination character for the current record.

+1
source

Source: https://habr.com/ru/post/1481878/


All Articles