With data frame as follows:
from pyspark.sql.functions import avg, first rdd = sc.parallelize( [ (0, "A", 223,"201603", "PORT"), (0, "A", 22,"201602", "PORT"), (0, "A", 422,"201601", "DOCK"), (1,"B", 3213,"201602", "DOCK"), (1,"B", 3213,"201601", "PORT"), (2,"C", 2321,"201601", "DOCK") ] ) df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"]) df_data.show()
I use it
df_data.groupby(df_data.id, df_data.type).pivot("date").agg(avg("cost"), first("ship")).show() +---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+ | id|type|201601_avg(cost)|201601_first(ship)()|201602_avg(cost)|201602_first(ship)()|201603_avg(cost)|201603_first(ship)()| +---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+ | 2| C| 2321.0| DOCK| null| null| null| null| | 0| A| 422.0| DOCK| 22.0| PORT| 223.0| PORT| | 1| B| 3213.0| PORT| 3213.0| DOCK| null| null| +---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
But I get these very complex column names. Using alias in aggregation usually works, but because of pivot , the names are even worse in this case:
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+ | id|type|201601_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201601_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201602_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201602_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201603_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201603_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620| +---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+ | 2| C| 2321.0| DOCK| null| null| null| null| | 0| A| 422.0| DOCK| 22.0| PORT| 223.0| PORT| | 1| B| 3213.0| PORT| 3213.0| DOCK| null| null| +---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
Is there a way to rename column names on the fly based on and aggregation?