Specify UTF-8 Encoding for Hadoop Jobs

Question

Specify UTF-8 Encoding for Hadoop Jobs

In the Hadoop assignment I'm writing, I need UTF-8 encoded strings for my cards / reducers. How to specify UTF-8 as my system encoding? If I run the code locally, I could just do java -Dfile.encoding="UTF-8" , but when I tried to run hadoop jap myjar.jar -Dfile.encoding="UTF-8" , it didn’t work.

+4

java encoding internationalization utf-8 hadoop

grautur Oct 11 '12 at 2:01

source share

2 answers

We found that the problem was that the java process handlers did not have -Dfile.encoding = UTF-8 . We must add this to "mapreduce.map.java.opts". The same goes for "mapreduce.reduce.java.opts".

You can do this in XML configuration files as well as in Java, for example:

config.set ("mapreduce.map.java.opts", "- Xmx1843M -Dfile.encoding = UTF-8");

For more information about the settings, see http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html .

+3

Don smith May 07 '15 at 18:43

source share

Lorand bendig · Accepted Answer · 2012-10-11T08:26:43+0000

By default, Hadoop uses UTF-8 encoding, so you do not need to provide such parameters.
I assume that you are using Text for your keys / values. According to javadoc : "This class stores text using standard UTF8 encoding"

Specify UTF-8 Encoding for Hadoop Jobs

More articles: