I am looking for a way to export data from Apache Spark to various other tools in JSON format. I suppose there should be a really easy way to do this.
Example: I have the following json jfile.json file:
{"key":value_a1, "key2":value_b1}, {"key":value_a2, "key2":value_b2}, {...}
where each line of the file is a JSON object. These files can be easily read in PySpark using
jsonRDD = jsonFile('jfile.json')
and then look (by calling jsonRDD.collect ()):
[Row(key=value_a1, key2=value_b1),Row(key=value_a2, key2=value_b2)]
Now I want to save these files back to a clean JSON file.
I found this entry in the Spark User list:
http://apache-spark-user-list.1001560.n3.nabble.com/Updating-exising-JSON-files-td12211.html
which claimed using
RDD.saveAsTextFile(jsonRDD)
After that, the text file looks like
Row(key=value_a1, key2=value_b1) Row(key=value_a2, key2=value_b2)
those. jsonRDD has just been explicitly written to the file. I would expect some kind of โautomaticโ conversion back to JSON after reading the Spark user list entry. My goal is to have a file similar to "jfile.json" mentioned at the beginning.
Am I missing a really obvious easy way to do this?
I read http://spark.apache.org/docs/latest/programming-guide.html , searched google, list of users and stack overflow for answers, but almost all answers relate to reading and parsing JSON into sparks. I even bought the book Learning Spark, but the examples there (p. 71) simply lead to the same output file as above.
Can anyone help me here? I feel that I am missing only a small link here.
Greetings and thanks in advance!