Pyspark approxQuantile function

Question

Pyspark approxQuantile function

I have a dataframe with these columns id, price, timestamp.

I would like to find the median value grouped by 'id'

I use this code to find it, but it gave me this error.

from pyspark.sql import DataFrameStatFunctions as statFunc
windowSpec = Window.partitionBy("id")
median = statFunc.approxQuantile("price",
                                 [0.5],
                                 0) \
                 .over(windowSpec)

return df.withColumn("Median", median)

Is it not possible to use a DataFrameStatFunctions to populate values in a new column?

TypeError: unbound method approxQuantile() must be called with DataFrameStatFunctions instance as first argument (got str instance instead)

+2

apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql

BK C. Jul 24 '17 at 18:43

source share

1 answer

desertnaut · Accepted Answer · 2017-08-04T11:58:56+0000

, , approxQuantile dataframe, , . , , , Spark ( PySpark) .

, approxQuantile ; DataFrame, .. DataFrameStatFunctions:

spark.version
# u'2.1.1'

sampleData = [("bob","Developer",125000),("mark","Developer",108000),("carl","Tester",70000),("peter","Developer",185000),("jon","Tester",65000),("roman","Tester",82000),("simon","Developer",98000),("eric","Developer",144000),("carlos","Tester",75000),("henry","Developer",110000)]

df = spark.createDataFrame(sampleData, schema=["Name","Role","Salary"])
df.show()
# +------+---------+------+ 
# |  Name|     Role|Salary|
# +------+---------+------+
# |   bob|Developer|125000| 
# |  mark|Developer|108000|
# |  carl|   Tester| 70000|
# | peter|Developer|185000|
# |   jon|   Tester| 65000|
# | roman|   Tester| 82000|
# | simon|Developer| 98000|
# |  eric|Developer|144000|
# |carlos|   Tester| 75000|
# | henry|Developer|110000|
# +------+---------+------+

med = df.approxQuantile("Salary", [0.5], 0.25) # no need to import DataFrameStatFunctions
med
# [98000.0]

DataFrameStatFunctions, , , :

from pyspark.sql import DataFrameStatFunctions as statFunc
med2 = statFunc.approxQuantile( "Salary", [0.5], 0.25)
# TypeError: unbound method approxQuantile() must be called with DataFrameStatFunctions instance as first argument (got str instance instead)

med2 = statFunc(df).approxQuantile( "Salary", [0.5], 0.25)
med2
# [82000.0]

PySpark ( , )... ? :

med == med2
# False

, ( , ), , ( , ). , ...

, , , approxQuantile dataframe - , :

df2 = df.withColumn('median_salary', statFunc(df).approxQuantile( "Salary", [0.5], 0.25))
# AssertionError: col should be Column

col withColumn, .. approxQuantile , , Column - , :

type(statFunc(df).approxQuantile( "Salary", [0.5], 0.25))
# list

, Spark Column, ; :

import pyspark.sql.functions as func
from pyspark.sql import Window

windowSpec = Window.partitionBy(df['Role'])
df2 = df.withColumn('mean_salary', func.mean(df['Salary']).over(windowSpec))
df2.show()
# +------+---------+------+------------------+
# |  Name|     Role|Salary|       mean_salary| 
# +------+---------+------+------------------+
# |  carl|   Tester| 70000|           73000.0| 
# |   jon|   Tester| 65000|           73000.0|
# | roman|   Tester| 82000|           73000.0|
# |carlos|   Tester| 75000|           73000.0|
# |   bob|Developer|125000|128333.33333333333|
# |  mark|Developer|108000|128333.33333333333| 
# | peter|Developer|185000|128333.33333333333| 
# | simon|Developer| 98000|128333.33333333333| 
# |  eric|Developer|144000|128333.33333333333|
# | henry|Developer|110000|128333.33333333333| 
# +------+---------+------+------------------+

, approxQuantile, mean a Column:

type(func.mean(df['Salary']).over(windowSpec))
# pyspark.sql.column.Column

Pyspark approxQuantile function

More articles: