I know that the tab is the default input separator for fields:
stream.map.output.field.separator stream.reduce.input.field.separator stream.reduce.output.field.separator mapreduce.textoutputformat.separator
but if I try to write the generic parser option:
stream.map.output.field.separator=\t (or) stream.map.output.field.separator="\t"
to check how hasoop parses space characters such as "\ t, \ n, \ f" when used as delimiters. I noticed that hadoop reads it as a \ t character, but not " ". I checked it by typing every line in the reducer (python) when it reads using:
sys.stdout.write(str(line))
My mapper emits key / value pairs like: key value1 value2
using the print (key,value1,value2,sep='\t',end='\n') command print (key,value1,value2,sep='\t',end='\n') .
Therefore, I expected my reducer to read each line as: key value1 value2 too, but instead sys.stdout.write(str(line)) printed:
key value1 value2 \\with trailing space
From Hadoop streaming - remove the trailing tab from the output of the reducer , I realized that the final space is due to the fact that mapreduce.textoutputformat.separator not installed and remains by default.
So this confirmed my suggestion that hadoop reviewed my general map output:
key value1 value2
as a key and value as an empty text object, since it reads the separator from stream.map.output.field.separator=\t as the character "\ t" instead of the tab itself "
Please help me understand this behavior and how I can use \ t as a delimiter if I want.