Hasoop 2.4.0 streaming common parser parameter using TAB as a delimiter

Question

Hasoop 2.4.0 streaming common parser parameter using TAB as a delimiter

I know that the tab is the default input separator for fields:

stream.map.output.field.separator stream.reduce.input.field.separator stream.reduce.output.field.separator mapreduce.textoutputformat.separator

but if I try to write the generic parser option:

 stream.map.output.field.separator=\t (or) stream.map.output.field.separator="\t"

to check how hasoop parses space characters such as "\ t, \ n, \ f" when used as delimiters. I noticed that hadoop reads it as a \ t character, but not " ". I checked it by typing every line in the reducer (python) when it reads using:

 sys.stdout.write(str(line))

My mapper emits key / value pairs like: key value1 value2

using the print (key,value1,value2,sep='\t',end='\n') command print (key,value1,value2,sep='\t',end='\n') .

Therefore, I expected my reducer to read each line as: key value1 value2 too, but instead sys.stdout.write(str(line)) printed:

key value1 value2 \\with trailing space

From Hadoop streaming - remove the trailing tab from the output of the reducer , I realized that the final space is due to the fact that mapreduce.textoutputformat.separator not installed and remains by default.

So this confirmed my suggestion that hadoop reviewed my general map output:

key value1 value2

as a key and value as an empty text object, since it reads the separator from stream.map.output.field.separator=\t as the character "\ t" instead of the tab itself "

Please help me understand this behavior and how I can use \ t as a delimiter if I want.

+6

python utf-8 mapreduce hadoop hadoop-streaming

annunarcist May 27, '15 at 18:57

source share

1 answer

Ramzy · Answer 1 · 2015-06-04T18:53:40+0000

You may have this problem "- D stream.map.output.field.separator =." indicates ".". as a field separator for card outputs, and the prefix to the fourth "." there will be a key in the line, and the rest of the line (excluding the fourth ".") will be the value. If the string has less than four "." S, then the whole line will be the key, and the value will be an empty text object (for example, created with new text (")). It clearly indicates how the separator is used, as well as how many of these separator attachments should be taken into account when identifying the card key and There are also fields related to partitioning, on the basis of which the reducer will be processed. Since you want the separator to be changed, I think you need to check this also on the partition and reducer.

Hasoop 2.4.0 streaming common parser parameter using TAB as a delimiter

More articles: