PySpark converts a map column to multiple columns in a data frame

Enter

I have a Parameters column of type map form:

 >>> from pyspark.sql import SQLContext >>> sqlContext = SQLContext(sc) >>> d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}] >>> df = sqlContext.createDataFrame(d) >>> df.collect() [Row(Parameters={'foo': '1', 'bar': '2', 'baz': 'aaa'})] 

Output

I want to change it in pyspark so that all keys ( foo , bar , etc.) are columns, namely:

 [Row(foo='1', bar='2', baz='aaa')] 

Using withColumn works:

 (df .withColumn('foo', df.Parameters['foo']) .withColumn('bar', df.Parameters['bar']) .withColumn('baz', df.Parameters['baz']) .drop('Parameters') ).collect() 

But I need a solution that does not explicitly mention column names , since I have dozens of them.

Scheme

 >>> df.printSchema() root |-- Parameters: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) 
+6
source share
1 answer

Since MapType keys are not part of the scheme, you will have to assemble them first, for example:

 from pyspark.sql.functions import explode keys = (df .select(explode("Parameters")) .select("key") .distinct() .rdd.flatMap(lambda x: x) .collect()) 

When you have all that is left, just select:

 from pyspark.sql.functions import col exprs = [col("Parameters").getItem(k).alias(k) for k in keys] df.select(*exprs) 
+10
source

Source: https://habr.com/ru/post/1247966/


All Articles