Create Spark DataFrame from Nested Dictionary

Question

Create Spark DataFrame from Nested Dictionary

I have a list of nested dictionaries, for example. ds = [{'a': {'b': {'c': 1}}}] and you want to create a spark DataFrame from it, and display a diagram of nested dictionaries . Using sqlContext.createDataFrame(ds).printSchema() gives me the following schema

 root |-- a: map (nullable = true) | |-- key: string | |-- value: map (valueContainsNull = true) | | |-- key: string | | |-- value: long (valueContainsNull = true)

but i need it

 root |-- a: struct (nullable = true) | |-- b: struct (nullable = true) | | |-- c: long (nullable = true)

The second scheme can be created by first converting dictionaries to JSON, and then load it using jsonRDD , like this sqlContext.jsonRDD(sc.parallelize([json.dumps(ds[0])])).printSchema() . But that would be rather cumbersome for large files.

I was thinking about converting dictionaries to pyspark.sql.Row() objects, hoping that the dataframe will output a schema, but this did not work when the dictionaries had different schemas (for example, some key was missing at first).

Is there any other way to do this? Thanks!

+6

apache-spark pyspark

Marigold Apr 21 '15 at 11:14

source share

1 answer

hyim · Answer 1 · 2015-06-18T15:06:44+0000

I think this will help.

 import json ds = [{'a': {'b': {'c': 1}}}] ds2 = [json.dumps(item) for item in ds] df = sqlCtx.jsonRDD(sc.parallelize(ds2)) df.printSchema()

Then

 root |-- a: struct (nullable = true) | |-- b: struct (nullable = true) | | |-- c: long (nullable = true)

Create Spark DataFrame from Nested Dictionary

More articles: