Pyspark: Save schemaRDD as json file

Question

Pyspark: Save schemaRDD as json file

I am looking for a way to export data from Apache Spark to various other tools in JSON format. I suppose there should be a really easy way to do this.

Example: I have the following json jfile.json file:

{"key":value_a1, "key2":value_b1}, {"key":value_a2, "key2":value_b2}, {...}

where each line of the file is a JSON object. These files can be easily read in PySpark using

 jsonRDD = jsonFile('jfile.json')

and then look (by calling jsonRDD.collect ()):

 [Row(key=value_a1, key2=value_b1),Row(key=value_a2, key2=value_b2)]

Now I want to save these files back to a clean JSON file.

I found this entry in the Spark User list:

http://apache-spark-user-list.1001560.n3.nabble.com/Updating-exising-JSON-files-td12211.html

which claimed using

 RDD.saveAsTextFile(jsonRDD)

After that, the text file looks like

 Row(key=value_a1, key2=value_b1) Row(key=value_a2, key2=value_b2)

those. jsonRDD has just been explicitly written to the file. I would expect some kind of “automatic” conversion back to JSON after reading the Spark user list entry. My goal is to have a file similar to "jfile.json" mentioned at the beginning.

Am I missing a really obvious easy way to do this?

I read http://spark.apache.org/docs/latest/programming-guide.html , searched google, list of users and stack overflow for answers, but almost all answers relate to reading and parsing JSON into sparks. I even bought the book Learning Spark, but the examples there (p. 71) simply lead to the same output file as above.

Can anyone help me here? I feel that I am missing only a small link here.

Greetings and thanks in advance!

+5

json python apache-spark

dp0377 Nov 04 '14 at 14:11

source share

3 answers

You can use the toJson () method, it allows you to convert SchemaRDD to MappedRDD JSON documents.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=tojson#pyspark.sql.SchemaRDD.toJSON

+5

jegordon Jan 12 '15 at 9:04

source share

I used org.apache.spark.sql.json in Spark SQL directly from the SQL console. Not the most effective way, and it can be considered hacking, but it does its job.

 CREATE TABLE jsonTable ( key STRING, value STRING ) USING org.apache.spark.sql.json OPTIONS ( PATH "destination/path" );

After creating the table, insert the data from the registered temp table or any other table

 INSERT OVERWRITE TABLE jsonTable SELECT * FROM tempTable;

NOTE It seems that this is the launch of the hive map, which reduces the task, creating several parts of the file along the provided path. Expected Slow Execution

NOTE : the path specified during table creation is in hdfs, not the local file system.

NOTE I have not tried embedding this in a script using SQLContext.sql, but it is possible.

NOTE : jsonTable table selection may fail due to serialization

0

Chris May 19 '15 at 4:43 pm

source share

Spiro michaylov · Accepted Answer · 2014-11-05T07:17:17+0000

I do not see an easy way to do this. One solution is to convert each SchemaRDD element to a String ending with RDD[String] , where each of the elements is formatted with JSON for that string. So you need to write your own JSON serializer. This is the easy part. It may not be very fast, but it should work in parallel, and you already know how to save the RDD to a text file.

The key concept is that you can get a schema view from SchemaRDD by calling the schema method. Then each Row passed to you on the map must go recursively in conjunction with the scheme. This is actually a tandem crawl for flat JSON, but you might also need to consider nested JSON.

The rest is just a small Python question that I am not talking about, but I have this working in Scala in. This will help you. The parts where the Scala code gets dense are not really dependent on Spark's deep knowledge, so if you can understand basic recursion and know Python, you should be able to get it working. The main part of the job for you is figuring out how to work with pyspark.sql.Row and pyspark.sql.StructType in the Python API.

One word of caution: I am sure that my code still does not work if there are no values - the formatItem method should handle null elements.

Edit: In Spark 1.2.0, the toJSON method was introduced in SchemaRDD , which makes it a much simpler problem - see answer by @jegordon.

Pyspark: Save schemaRDD as json file

More articles: