I would like to calculate group quantiles on a Spark data frame (using PySpark). Either an approximate or accurate result will be in order. I prefer a solution that I can use in the groupBy
/ context agg
so that I can mix it with other PySpark aggregate functions. If for some reason this is not possible, then another approach will do.
This question is related, but does not indicate how to use approxQuantile
as an aggregate function.
I also have access to the UDF- percentile_approx
Pertile Percent_approx, but I don’t know how to use it as an aggregate function.
Suppose I have the following data frame:
from pyspark import SparkContext
import pyspark.sql.functions as f
sc = SparkContext()
df = sc.parallelize([
['A', 1],
['A', 2],
['A', 3],
['B', 4],
['B', 5],
['B', 6],
]).toDF(('grp', 'val'))
df_grp = df.groupBy('grp').agg(f.magic_percentile('val', 0.5).alias('med_val'))
df_grp.show()
Expected Result:
+----+-------+
| grp|med_val|
+----+-------+
| A| 2|
| B| 5|
+----+-------+