Hadoop: Zipping files using MapReduce streaming, can I control the order of output parts?

I am trying to pin files to HDFS using BZip2. Doing this using MapReduce streaming seems like a good approach, according to the answer to the following message:

Hadoop compresses a file in HDFS

Relevant sample code from this post:

    hadoop jar contrib/streaming/hadoop-streaming-1.0.3.jar \
            -Dmapred.reduce.tasks=0 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
            -input filename \
            -output /filename \
            -mapper /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat
hadoop fs -cat /path/part* | hadoop fs -put - /path/compressed.gz

In practice, I performed the equivalent of the above instead of BZip2 and used the Java Hadoop Streaming API and used the ToolRunner to invoke the streaming job in the cluster from a non-cluster machine.

Important features of this approach are:

  • Since this is a streaming job, where the input format is text, only the value, not the key, that is, the contents of the file line, and not the byte offset, is output from the map job.

  • . , , , .

  • -.

, , , , , * , . , , .

, , : , .

:

MapReduce

, * . , , , , ( , , 1 ), . , , .

, . , , - - // , part * MapReduce, , , . , MapReduce , , .

- , , , .

+4

Source: https://habr.com/ru/post/1611597/


All Articles