I am currently trying to use the column aliases that I get after turning on a value in the Pyspark framework. The problem here is that the column names that I put in the alias call are not set correctly.
Specific example:
Starting from this data frame:
import pyspark.sql.functions as func df = sc.parallelize([ (217498, 100000001, 'A'), (217498, 100000025, 'A'), (217498, 100000124, 'A'), (217498, 100000152, 'B'), (217498, 100000165, 'C'), (217498, 100000177, 'C'), (217498, 100000182, 'A'), (217498, 100000197, 'B'), (217498, 100000210, 'B'), (854123, 100000005, 'A'), (854123, 100000007, 'A') ]).toDF(["user_id", "timestamp", "actions"])
which gives
+-------+--------------------+------------+ |user_id| timestamp | actions | +-------+--------------------+------------+ | 217498| 100000001| 'A' | | 217498| 100000025| 'A' | | 217498| 100000124| 'A' | | 217498| 100000152| 'B' | | 217498| 100000165| 'C' | | 217498| 100000177| 'C' | | 217498| 100000182| 'A' | | 217498| 100000197| 'B' | | 217498| 100000210| 'B' | | 854123| 100000005| 'A' | | 854123| 100000007| 'A' |
The problem is that the call
df = df.groupby('user_id')\ .pivot('actions')\ .agg(func.count('timestamp').alias('ts_count'), func.mean('timestamp').alias('ts_mean'))
gives column names
df.columns ['user_id', 'A_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L', 'A_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5', 'B_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L', 'B_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5', 'C_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L', 'C_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5']
which are completely impractical.
I could clear the column names using the methods shown here - (regex) or here - (using withColumnRenamed () . However, these are temporary solutions that can easily break after updating.
To summarize: how can I use the columns generated by the fulcrum without the need to analyze them? (for example, "A_ (timestamp), mode = Complete, isDistinct = false) AS ts_count # 4L 'generated name)?
Any help would be appreciated! Thanks