Spark: Cut no. output files

I wrote a Spark program that mimics the functionality of an existing Map Reduce job. MR takes about 50 minutes every day, but Spark took just 9 minutes! Wonderful!

When I looked at the output directory, I noticed that it created 1020 part files. The MR job uses only 20 gears, so it only creates 20 files. We need to reduce the number of output files; otherwise our namespace would be complete in no time.

I am trying to figure out how to reduce the number of output files under Spark. 1020 tasks seem to be triggering, and each one creates a part file. It's right? Do I have to change the level of parallelism to cut no. tasks, thereby reducing the number. output files? If so, how do I install it? I'm afraid not. tasks will slow this process down, but I can check it out!

+6
source share
1 answer

Reducing the number of tasks to be reduced will slow down the process. However, it should still be significantly faster than the Hadoop MapReduce for your use case.

In my opinion, the best way to limit the number of output files is to convert coalesce(numPartitions) . The following is an example:

 JavaSparkContext ctx = new JavaSparkContext(/*your configuration*/); JavaRDD<String> myData = ctx.textFile("path/to/my/file.txt"); //Consider we have 1020 partitions and thus 1020 map tasks JavaRDD<String> mappedData = myData.map( your map function ); //Consider we need 20 output files JavaRDD<String> newData = mappedData.coalesce(20) newData.saveAsTextFile("output path"); 

In this example, the display function will be performed by 1020 tasks that will not be changed in any way. However, after merging partitions, there should only be 20 partitions to work. In this case, 20 output files will be saved at the end of the program.

As mentioned earlier, note that this method will be slower than 1020 output files. Data must be stored in several sections (from 1020 to 20).

Note. note the repartition command at the following link .

+9
source

Source: https://habr.com/ru/post/975630/


All Articles