Hadoop custom split for TextFile

I have a fairly large text file that I would like to convert to a SequenceFile. Unfortunately, the file consists of Python code with logical lines running on several physical lines. For example, photographs print "Blah Blah \" ... blah blah "
Each logical line ends with NEWLINE. Can anyone clarify how I could generate Key, Value pairs in Map-Reduce, where each value is an entire logical line?

+2
source share
3 answers

You must create your own variant in TextInputFormat. There you create a new RecordReader that skips lines until it sees the beginning of a logical line.

+1
source

I have not asked a question before, but you just need to iterate over the strings by the simple mapreduce task and save them in StringBuilder. Drop StringBuilder into context if you want to start with a new entry. The trick is to set the StringBuilder in your mappers class as a field, not as a local variable.

here it is: Processing paragraphs in text files as single entries using Hadoop

+4
source

Preprocess the input file to remove new lines. What is your goal in creating a SequenceFile?

0
source

Source: https://habr.com/ru/post/892951/


All Articles