Specify UTF-8 Encoding for Hadoop Jobs

In the Hadoop assignment I'm writing, I need UTF-8 encoded strings for my cards / reducers. How to specify UTF-8 as my system encoding? If I run the code locally, I could just do java -Dfile.encoding="UTF-8" , but when I tried to run hadoop jap myjar.jar -Dfile.encoding="UTF-8" , it didn’t work.

+4
source share
2 answers

By default, Hadoop uses UTF-8 encoding, so you do not need to provide such parameters.
I assume that you are using Text for your keys / values. According to javadoc : "This class stores text using standard UTF8 encoding"

+2
source

We found that the problem was that the java process handlers did not have -Dfile.encoding = UTF-8 . We must add this to "mapreduce.map.java.opts". The same goes for "mapreduce.reduce.java.opts".

You can do this in XML configuration files as well as in Java, for example:

config.set ("mapreduce.map.java.opts", "- Xmx1843M -Dfile.encoding = UTF-8");

For more information about the settings, see http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html .

+3
source

Source: https://habr.com/ru/post/1439054/


All Articles