Hadoop custom split for TextFile

Question

Hadoop custom split for TextFile

I have a fairly large text file that I would like to convert to a SequenceFile. Unfortunately, the file consists of Python code with logical lines running on several physical lines. For example, photographs print "Blah Blah \" ... blah blah "
Each logical line ends with NEWLINE. Can anyone clarify how I could generate Key, Value pairs in Map-Reduce, where each value is an entire logical line?

+2

hadoop

dvk Jun 13 '11 at 6:34

source share

3 answers

I have not asked a question before, but you just need to iterate over the strings by the simple mapreduce task and save them in StringBuilder. Drop StringBuilder into context if you want to start with a new entry. The trick is to set the StringBuilder in your mappers class as a field, not as a local variable.

here it is: Processing paragraphs in text files as single entries using Hadoop

+4

Thomas jungblut Jun 15 '11 at 15:41

source share

Preprocess the input file to remove new lines. What is your goal in creating a SequenceFile?

0

David medinets Jun 15 '11 at 15:09

source share

Niels basjes · Accepted Answer · 2011-06-14T08:53:10+0000

You must create your own variant in TextInputFormat. There you create a new RecordReader that skips lines until it sees the beginning of a logical line.

Hadoop custom split for TextFile

More articles: