Calculating a quantile from grouped data in a Dataframe spark box

Question

Calculating a quantile from grouped data in a Dataframe spark box

I have the following light circuit:

 agent_id|payment_amount|
+--------+--------------+
|       a|          1000|
|       b|          1100|
|       a|          1100|
|       a|          1200|
|       b|          1200|
|       b|          1250|
|       a|         10000|
|       b|          9000|
+--------+--------------+

the output of my desire will be something like

agen_id   95_quantile
  a          whatever is 95 quantile for agent a payments
  b          whatever is 95 quantile for agent b payments

for each agent_id group I need to calculate a 0.95 quantum, I use the following approach:

test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)

but I accept the following error:

'GroupedData' object has no attribute 'approxQuantile'

I need to have a .95 quantile (percentile) in a new column, so can later be used for filtering purposes

+4

apache-spark pyspark apache-spark-sql spark-dataframe

sanaz Sep 22 '16 at 8:10

source share

1 answer

eliasah · Accepted Answer · 2016-09-22T09:53:51+0000

One solution would be to use HiveContextand percentile_approx:

>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")

>>> df2.show()
# +--------+-----------------+
# |agent_id|   approxQuantile|
# +--------+-----------------+
# |       a|8239.999999999998|
# |       b|7449.999999999998|
# +--------+-----------------+

Note 1: This solution was tested with spark 1.6.2.

2: approxQuantile Spark & lt; 2.0 pyspark.

3: percentile pth ( ) . , .

Calculating a quantile from grouped data in a Dataframe spark box

More articles: