Processing paragraphs in text files as separate entries using Hadoop

Question

Processing paragraphs in text files as separate entries using Hadoop

To simplify my problem a bit, I have a set of text files with "records" that are limited to double newline characters. how

'multi-line text'
'empty line'
'multi-line text'
'empty line'

etc.

I need to convert each multi-line unit separately and then mapreduce to them.

However, I know that with the default setting of wordcount in the hadoop code template, entering the value variable in the next function is just one line, and there is no guarantee that the input is in contact with the previous input line.

 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException ;

And I need the input value to actually be a single unit of multi-line text with double line with line separators.

The RecordReader and getSplits classes have appeared in some searches, but there are no simple code examples that I could wrap around.

An alternative solution is to simply replace all newlines in multiline text with space characters and do it with it. I would prefer not to do this, because there is quite a lot of text, and it takes a lot of time in terms of runtime. I also need to modify a lot of code if I do this, so dealing with it through chaos would be most attractive to me.

+1

java mapreduce hadoop

Jasonmond Apr 29 '11 at 4:44

source share

2 answers

What is the problem? Just put the previous lines in a StringBuilder and clear them when you reach the new record.
When you use text files, they will not be split. In these cases, it uses FileInputFormat, which only parallelizes the number of available files.

+1

Thomas jungblut Apr 29 '11 at 6:30

source share

Pranab · Accepted Answer · 2011-06-16T02:05:21+0000

If your files are small in size, they will not be split. Essentially, each file is a single partition assigned to one mapping instance. In this case, I agree with Thomas. You can build your logical record in your mapper class by concatenating strings. You can find your record boundary by looking for the empty string that appears as the value for your cartographer.

However, if the files are large and divided, then I see no other option than to implement my own text input format. You can clone existing Java classes Hairop LineRecordReader and LineReader. You should make a small change to your version of the LineReader class so that the record separator is two new lines, not one. Once this is done, your cartographer will receive a few lines as input.

Processing paragraphs in text files as separate entries using Hadoop

More articles: