SaveAsTextFile method in spark mode

Question

SaveAsTextFile method in spark mode

In my project, I have three input files and make the file names as args (0) in args (2), I also have the name of the output file as args (3), in the source code I use

val sc = new SparkContext() var log = sc.textFile(args(0)) for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))

I do nothing for the log, but save it as a text file with

 log.coalesce(1, true).saveAsTextFile(args(args.size - 1))

but it is still saved in 3 files as part-00000, part-00001, part-00002. So can I save the three input files to the output file?

+6

scala apache-spark

kemiya Dec 31 '15 at 8:27

source share

2 answers

xhudik · Answer 1 · 2015-01-02T15:09:56+0000

Having multiple output files is the standard behavior of clusters with multiple machines, such as Hadoop or Spark. The number of output files depends on the number of reducers.

How to "solve" it in Hadoop: merge output files after phase reduction

How to "solve" in Spark: how to make saveAsTextFile NOT split output into multiple files?

You can get good information here: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html

So you were right about coalesce(1,true) . However, this is very inefficient. Interestingly, (as mentioned in his comment by @climbage) your code works if you run it locally.

What you can try is to read the files first and then save the result.

 ... val sc = new SparkContext() var str = new String("") for(i <- 0 until args.size - 1){ val file = sc.textFile(args(i)) file.foreach(line => str+= line) } //and now you might save the content str.coalesce(1, true).saveAsTextFile("out")

Note: this code is also extremely inefficient and works only for small files !!! You need to come up with a better code. I would not try to reduce the number of files, but instead processed several output files.

Steve · Answer 2 · 2015-01-02T17:41:35+0000

As already mentioned, your problem is somewhat inevitable with the standard API, as it is assumed that you are dealing with large data quanatities. However, if I believe your data is manageable, you can try the following

 import java.nio.file.{Paths, Files} import java.nio.charset.StandardCharsets Files.write(Paths.get("./test_file"), data.collect.mkString("\n").getBytes(StandardCharsets.UTF_8))

What I'm doing here is converting RDD to String by executing the collect command and then mkString. I would suggest not doing this in production. It works great for local data analysis (Work with 5 GB of local data)

SaveAsTextFile method in spark mode

More articles: