Reducing the number of tasks to be reduced will slow down the process. However, it should still be significantly faster than the Hadoop MapReduce for your use case.
In my opinion, the best way to limit the number of output files is to convert coalesce(numPartitions) . The following is an example:
JavaSparkContext ctx = new JavaSparkContext(); JavaRDD<String> myData = ctx.textFile("path/to/my/file.txt");
In this example, the display function will be performed by 1020 tasks that will not be changed in any way. However, after merging partitions, there should only be 20 partitions to work. In this case, 20 output files will be saved at the end of the program.
As mentioned earlier, note that this method will be slower than 1020 output files. Data must be stored in several sections (from 1020 to 20).
Note. note the repartition command at the following link .
source share