We found that the problem was that the java process handlers did not have -Dfile.encoding = UTF-8 . We must add this to "mapreduce.map.java.opts". The same goes for "mapreduce.reduce.java.opts".
You can do this in XML configuration files as well as in Java, for example:
config.set ("mapreduce.map.java.opts", "- Xmx1843M -Dfile.encoding = UTF-8");
For more information about the settings, see http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html .
source share