How to reduce the number of output files in Apache Hive

Question

How to reduce the number of output files in Apache Hive

Does anyone know a tool that can crunch Apache Hadoop output files into fewer files or a single file. Currently, I download all the files to the local computer and merge them into one file. So does anyone know an API or tool that does the same thing. Thanks in advance.

+4

mapreduce hadoop emr hive

CruncherBigData May 12, '13 at 19:54

source share

2 answers

Tariq · Answer 1 · 2013-05-12T23:28:36+0000

Limiting the number of output files means that you want to limit the number of reducers. You can do this using the mapred.reduce.tasks property from the Hive shell. Example:

 hive> set mapred.reduce.tasks = 5;

But this may affect the performance of your request. Alternatively, you can use the getmerge command from the HDFS shell as soon as you finish your request. This command takes the source directory and target file as input and combines the files in src into a local destination file.

Using:

 bin/hadoop fs -getmerge <src> <localdst>

NTN

Greeniant · Answer 2 · 2019-10-01T19:49:25+0000

See https://community.cloudera.com/t5/Support-Questions/Hive-Multiple-Small-Files/td-p/204038

 set hive.merge.mapfiles=true; -- Merge small files at the end of a map-only job. set hive.merge.mapredfiles=true; -- Merge small files at the end of a map-reduce job. set hive.merge.size.per.task=???; -- Size (bytes) of merged files at the end of the job. set hive.merge.smallfiles.avgsize=??? -- File size (bytes) threshold -- When the average output file size of a job is less than this number, -- Hive will start an additional map-reduce job to merge the output files -- into bigger files. This is only done for map-only jobs if hive.merge.mapfiles -- is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

How to reduce the number of output files in Apache Hive

More articles: