PySpark converts a map column to multiple columns in a data frame

Question

PySpark converts a map column to multiple columns in a data frame

Enter

I have a Parameters column of type map form:

 >>> from pyspark.sql import SQLContext >>> sqlContext = SQLContext(sc) >>> d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}] >>> df = sqlContext.createDataFrame(d) >>> df.collect() [Row(Parameters={'foo': '1', 'bar': '2', 'baz': 'aaa'})]

Output

I want to change it in pyspark so that all keys ( foo , bar , etc.) are columns, namely:

 [Row(foo='1', bar='2', baz='aaa')]

Using withColumn works:

 (df .withColumn('foo', df.Parameters['foo']) .withColumn('bar', df.Parameters['bar']) .withColumn('baz', df.Parameters['baz']) .drop('Parameters') ).collect()

But I need a solution that does not explicitly mention column names , since I have dozens of them.

Scheme

 >>> df.printSchema() root |-- Parameters: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true)

+6

python apache-spark pyspark spark-dataframe

ksindi Apr 26 '16 at 15:18

source share

1 answer

zero323 · Accepted Answer · 2016-04-26T16:41:17+0000

Since MapType keys are not part of the scheme, you will have to assemble them first, for example:

 from pyspark.sql.functions import explode keys = (df .select(explode("Parameters")) .select("key") .distinct() .rdd.flatMap(lambda x: x) .collect())

When you have all that is left, just select:

 from pyspark.sql.functions import col exprs = [col("Parameters").getItem(k).alias(k) for k in keys] df.select(*exprs)

PySpark converts a map column to multiple columns in a data frame

Enter

Output

Scheme

More articles: