Pyspark 1.6 - Column offset after rotation with multiple units

Question

Pyspark 1.6 - Column offset after rotation with multiple units

I am currently trying to use the column aliases that I get after turning on a value in the Pyspark framework. The problem here is that the column names that I put in the alias call are not set correctly.

Specific example:

Starting from this data frame:

import pyspark.sql.functions as func df = sc.parallelize([ (217498, 100000001, 'A'), (217498, 100000025, 'A'), (217498, 100000124, 'A'), (217498, 100000152, 'B'), (217498, 100000165, 'C'), (217498, 100000177, 'C'), (217498, 100000182, 'A'), (217498, 100000197, 'B'), (217498, 100000210, 'B'), (854123, 100000005, 'A'), (854123, 100000007, 'A') ]).toDF(["user_id", "timestamp", "actions"])

which gives

 +-------+--------------------+------------+ |user_id| timestamp | actions | +-------+--------------------+------------+ | 217498| 100000001| 'A' | | 217498| 100000025| 'A' | | 217498| 100000124| 'A' | | 217498| 100000152| 'B' | | 217498| 100000165| 'C' | | 217498| 100000177| 'C' | | 217498| 100000182| 'A' | | 217498| 100000197| 'B' | | 217498| 100000210| 'B' | | 854123| 100000005| 'A' | | 854123| 100000007| 'A' |

The problem is that the call

 df = df.groupby('user_id')\ .pivot('actions')\ .agg(func.count('timestamp').alias('ts_count'), func.mean('timestamp').alias('ts_mean'))

gives column names

 df.columns ['user_id', 'A_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L', 'A_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5', 'B_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L', 'B_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5', 'C_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L', 'C_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5']

which are completely impractical.

I could clear the column names using the methods shown here - (regex) or here - (using withColumnRenamed () . However, these are temporary solutions that can easily break after updating.

To summarize: how can I use the columns generated by the fulcrum without the need to analyze them? (for example, "A_ (timestamp), mode = Complete, isDistinct = false) AS ts_count # 4L 'generated name)?

Any help would be appreciated! Thanks

+5

python-2.7 pivot apache-spark pyspark pyspark-sql

hyperc54 Jan 24 '17 at 16:02

source share

No one has answered this question yet.

See similar questions:

53

renaming columns for pyspark data aggregates

3

Renaming a sliding and aggregated column in DataSphere PySpark

or similar: