How to apply a function to each row of a specified PySpark DataFrame column

Question

How to apply a function to each row of a specified PySpark DataFrame column

I have a PySpark DataFrame data file consisting of three columns, the structure of which is given below.

In[1]: df.take(1)    
Out[1]:
[Row(angle_est=-0.006815859163590619, rwsep_est=0.00019571401752467945, cost_est=34.33651951754235)]

What I want to do is get each value of the first column ( angle_est) and pass it as a parameter xMisallignmentto a specific function to set a specific property of the class object. Defined Function:

def setMisAllignment(self, xMisallignment):
    if np.abs(xMisallignment) > 0.8:
       warnings.warn('You might set misallignment angle too large.')
    self.MisAllignment = xMisallignment

I am trying to select the first column and convert it to rdd and apply the above function to the map () function, but it doesn't seem to work, MisAllignmentit hasn't changed.

df.select(df.angle_est).rdd.map(lambda row: model0.setMisAllignment(row))

In[2]: model0.MisAllignment
Out[2]: 0.00111511718224

Anyone have any ideas to help me let this feature work? Thanks in advance!

+4

apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql

Molly huang 17 . '17 1:44

1

Sachin Janani · Accepted Answer · 2017-07-17T04:17:46+0000

UDF - :

spark.udf.register("misallign", setMisAllignment)

UDF : https://github.com/apache/spark/blob/master/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDFSuite.java

, .

How to apply a function to each row of a specified PySpark DataFrame column

More articles: