To simplify my problem a bit, I have a set of text files with "records" that are limited to double newline characters. how
'multi-line text'
'empty line'
'multi-line text'
'empty line'
etc.
I need to convert each multi-line unit separately and then mapreduce to them.
However, I know that with the default setting of wordcount in the hadoop code template, entering the value variable in the next function is just one line, and there is no guarantee that the input is in contact with the previous input line.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException ;
And I need the input value to actually be a single unit of multi-line text with double line with line separators.
The RecordReader and getSplits classes have appeared in some searches, but there are no simple code examples that I could wrap around.
An alternative solution is to simply replace all newlines in multiline text with space characters and do it with it. I would prefer not to do this, because there is quite a lot of text, and it takes a lot of time in terms of runtime. I also need to modify a lot of code if I do this, so dealing with it through chaos would be most attractive to me.
source share