Pyspark Spark DataFrame - group and filter columns in a map type column

Question

Pyspark Spark DataFrame - group and filter columns in a map type column

My DataFrame looks like this:

 | c1 | c2| c3 | |----+---+------- | A | b | 22:00| | A | b | 23:00| | A | b | 09:00| | A | c | 22:00| | B | c | 09:30|

I would like to do some aggregation and create a second DataFrame with three columns:

c1 : this is the column that I want to group.

map_category_room_date : map type, enter c2 and the lower / minimum value in c3 .

cnt_orig : counts the number of lines of the source group.

Result

 | c1 | map_category_room_date | cnt_orig | |----------+-------------------------+----------| | 'A' |{'b': 09:00, 'C': 22:00} | 4 | | 'B' |{'c': 09:30} | 1 |

What aggregate functions can I use for archiving, is this the easiest way?

thanks

+1

python aggregate-functions dataframe apache-spark

Lou_Ds Aug 1 '17 at 18:19

source share

1 answer

Ramesh maharjan · Answer 1 · 2017-08-02T08:11:26+0000

You can window execute the count function and then use inbuilt functions to get the final data file you want by doing the following

 from pyspark.sql import Window windowSpec = Window.partitionBy("c1") from pyspark.sql import functions as F df.withColumn("cnt_orig", count('c1').over(windowSpec)).orderBy('c3').groupBy("c1", "c2", "cnt_orig").agg(first('c3').as('c3')) .withColumn("c2", F.regexp_replace(F.regexp_replace(F.array($"c2", $"c3").cast(StringType), "[\\[\\]]", ""), ",", " : ")) .groupBy("c1", "cnt_orig").agg(F.collect_list("c2").as('map_category_room_date'))

You should get the following result

 +---+--------+----------------------+ |c1 |cnt_orig|map_category_room_date| +---+--------+----------------------+ |A |4 |[b : 09:00, c : 22:00]| |b |1 |[c : 09:00] | +---+--------+----------------------+

Scala way

working code to get the desired result in scala is

 val windowSpec = Window.partitionBy("c1") df.withColumn("cnt_orig", count("c1").over(windowSpec)).orderBy("c3").groupBy("c1", "c2", "cnt_orig").agg(first("c3").as("c3")) .withColumn("c2", regexp_replace(regexp_replace(array($"c2", $"c3").cast(StringType), "[\\[\\]]", ""), ",", " : ")) .groupBy("c1", "cnt_orig").agg(collect_list("c2").as("map_category_room_date"))

Pyspark Spark DataFrame - group and filter columns in a map type column

More articles: