How to read a record divided into several lines, as well as how to process broken records during data splitting

I have a log file as below

Begin ... 12-07-2008 02:00:05 ----> record1 incidentID: inc001 description: blah blah blah owner: abc status: resolved end .... 13-07-2008 02:00:05 Begin ... 12-07-2008 03:00:05 ----> record2 incidentID: inc002 description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah owner: abc status: resolved end .... 13-07-2008 03:00:05 

I want to use mapreduce to process it. And I want to extract the incident identifier, status, and time spent on the incident

How to handle both records, since they have a variable record length, and what if the input is split before the recording ends.

+4
source share
2 answers

You will need to write your own input format and recorder to ensure that the files are divided correctly around your record separator.

Basically, your record reader will need to look for its byte offset, scan forward (read lines) until it finds:

  • Begin ... string Begin ...
    • Read the lines to the next line end ... and specify these lines between the beginning and end as input for the next entry
  • He scans the end of the end of the split or finds EOF

This is similar in the algorithm to how Mahout XMLInputFormat treats multiline XML as input - in fact, you could modify this source of code to cope with your situation.

As mentioned in @irW's answer, NLineInputFormat is another option if your records have a fixed number of lines per record but are really inefficient for large files, since it must open and read the entire file in order to detect line offsets in the format of the getSplits() input format getSplits() .

+4
source

in your examples, each entry has the same number of rows. If so, you can use NLinesInputFormat, if it is not possible to find out the number of rows, it can be more complicated. (more on NlinesInputFormat: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html )

+1
source

Source: https://habr.com/ru/post/1491992/


All Articles