Create Spark DataFrame from Nested Dictionary

I have a list of nested dictionaries, for example. ds = [{'a': {'b': {'c': 1}}}] and you want to create a spark DataFrame from it, and display a diagram of nested dictionaries . Using sqlContext.createDataFrame(ds).printSchema() gives me the following schema

 root |-- a: map (nullable = true) | |-- key: string | |-- value: map (valueContainsNull = true) | | |-- key: string | | |-- value: long (valueContainsNull = true) 

but i need it

 root |-- a: struct (nullable = true) | |-- b: struct (nullable = true) | | |-- c: long (nullable = true) 

The second scheme can be created by first converting dictionaries to JSON, and then load it using jsonRDD , like this sqlContext.jsonRDD(sc.parallelize([json.dumps(ds[0])])).printSchema() . But that would be rather cumbersome for large files.

I was thinking about converting dictionaries to pyspark.sql.Row() objects, hoping that the dataframe will output a schema, but this did not work when the dictionaries had different schemas (for example, some key was missing at first).

Is there any other way to do this? Thanks!

+6
source share
1 answer

I think this will help.

 import json ds = [{'a': {'b': {'c': 1}}}] ds2 = [json.dumps(item) for item in ds] df = sqlCtx.jsonRDD(sc.parallelize(ds2)) df.printSchema() 

Then

 root |-- a: struct (nullable = true) | |-- b: struct (nullable = true) | | |-- c: long (nullable = true) 
+2
source

Source: https://habr.com/ru/post/985596/


All Articles