How to combine research results files without redistribution and copyMerge?

Question

How to combine research results files without redistribution and copyMerge?

I am using the following code:

csv.saveAsTextFile(pathToResults, classOf[GzipCodec])

The pathToResults directory contains many files, such as part-0000, part-0001, etc. I can use FileUtil.copyMerge (), but it is very slow, it loads all the files into the driver program and then loads them into hadoop. But FileUtil.copyMerge () is faster than:

 csv.repartition(1).saveAsTextFile(pathToResults, classOf[GzipCodec])

How to combine research results files without redistribution and FileUtil.copyMerge ()?

+7

scala hadoop apache-spark

Leonard Mar 13 '15 at 4:57

source share

3 answers

Had the exact same question and had to write pySpark code (with Hadoop API calls) that implements copyMerge:

https://github.com/Tagar/stuff/blob/master/copyMerge.py

Unfortunately, copyMerge as a standalone Hadoop API call will be deprecated and deleted in Hadoop 3.0. Thus, this implementation is independent of Hadoop copyMerge (it reimplementes it).

0

Tagar Oct 24 '17 at 9:41

source share

coalesce (1) works just fine. I also see the hadoop streaming option, which can combine HDFS files on the fly if you want to run this script:

 $ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \ -Dmapred.reduce.tasks=1 \ -input "/hdfs/input/dir" \ -output "/hdfs/output/dir" \ -mapper cat \ -reducer cat

0

Jeff A. May 15, '19 at 2:59 pm

source share

0x0FFF · Accepted Answer · 2015-03-13T05:22:38+0000

Unfortunately, there is no other option to get one output file in Spark. Instead of repartition(1) you can use coalesce(1) , but with parameter 1 their behavior will be the same. Spark collected your data in one section in memory, which can lead to an OOM error if your data is too large.

Another way to merge files on HDFS may be to write a simple MapReduce task (or a Pig or Hadoop Streaming job task) that would receive the entire directory as an input and create one output file with one reducer. But keep in mind that using the MapReduce approach, all data will first be copied to the local reducer file system, which can cause an “out of space” error.

Here are some useful links on one topic:

How to combine research results files without redistribution and copyMerge?

More articles: