I have a list of nested dictionaries, for example. ds = [{'a': {'b': {'c': 1}}}] and you want to create a spark DataFrame from it, and display a diagram of nested dictionaries . Using sqlContext.createDataFrame(ds).printSchema() gives me the following schema
root |-- a: map (nullable = true) | |-- key: string | |-- value: map (valueContainsNull = true) | | |-- key: string | | |-- value: long (valueContainsNull = true)
but i need it
root |-- a: struct (nullable = true) | |-- b: struct (nullable = true) | | |-- c: long (nullable = true)
The second scheme can be created by first converting dictionaries to JSON, and then load it using jsonRDD , like this sqlContext.jsonRDD(sc.parallelize([json.dumps(ds[0])])).printSchema() . But that would be rather cumbersome for large files.
I was thinking about converting dictionaries to pyspark.sql.Row() objects, hoping that the dataframe will output a schema, but this did not work when the dictionaries had different schemas (for example, some key was missing at first).
Is there any other way to do this? Thanks!
source share