Pyspark Dataframe Apply function to two columns

Question

Pyspark Dataframe Apply function to two columns

Let's say I have two PySpark df1 and df2 data df2 .

 df1= 'a' 1 2 5 df2= 'b' 3 6

And I want to find the closest df2['b'] value for each df1['a'] and add the closest values as a new column in df1 .

In other words, for each value of x in df1['a'] I want to find y that reaches min(abx(xy)) for all y in df2['b'] (note: it can be assumed that there is only one y , which can reach a minimum distance), and the result will be

 'a' 'b' 1 3 2 3 5 6

I tried the following code to first create a distance matrix (before finding values that reach the minimum distance):

 from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf def dict(x,y): return abs(xy) udf_dict = udf(dict, IntegerType()) sql_sc = SQLContext(sc) udf_dict(df1.a, df2.b)

which gives

 Column<PythonUDF#dist(a,b)>

Then i tried

 sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b))

which works forever without providing error / output.

My questions:

As I am new to Spark, is my way to build an output DataFrame efficient? (My path would create a distance matrix for all values of a and b , and then find min one)
What happened to the last line of my code and how to fix it?

+5

pyspark spark-dataframe pyspark-sql

Chianti5 Nov 02 '16 at 20:42

source share

1 answer

Mariusz · Accepted Answer · 2016-11-02T21:52:14+0000

Starting from the second question - you can apply udf only to the existing data framework, I think you were thinking of something like this:

 >>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show() +---+---+--------+ | a| b|distance| +---+---+--------+ | 1| 3| 2| | 1| 6| 5| | 2| 3| 1| | 2| 6| 4| | 5| 3| 2| | 5| 6| 1| +---+---+--------+

But there is a more efficient way to apply this distance using internal abs :

 >>> from pyspark.sql.functions import abs >>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b))

Then you can find the corresponding numbers by calculating:

 >>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b)) >>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance')) >>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show() +---+---+ | a| b| +---+---+ | 5| 6| | 1| 3| | 2| 3| +---+---+

Pyspark Dataframe Apply function to two columns

More articles: