Pyspark Dataframe Apply function to two columns

Let's say I have two PySpark df1 and df2 data df2 .

 df1= 'a' 1 2 5 df2= 'b' 3 6 

And I want to find the closest df2['b'] value for each df1['a'] and add the closest values ​​as a new column in df1 .

In other words, for each value of x in df1['a'] I want to find y that reaches min(abx(xy)) for all y in df2['b'] (note: it can be assumed that there is only one y , which can reach a minimum distance), and the result will be

 'a' 'b' 1 3 2 3 5 6 

I tried the following code to first create a distance matrix (before finding values ​​that reach the minimum distance):

 from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf def dict(x,y): return abs(xy) udf_dict = udf(dict, IntegerType()) sql_sc = SQLContext(sc) udf_dict(df1.a, df2.b) 

which gives

 Column<PythonUDF#dist(a,b)> 

Then i tried

 sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b)) 

which works forever without providing error / output.

My questions:

  • As I am new to Spark, is my way to build an output DataFrame efficient? (My path would create a distance matrix for all values ​​of a and b , and then find min one)
  • What happened to the last line of my code and how to fix it?
+5
source share
1 answer

Starting from the second question - you can apply udf only to the existing data framework, I think you were thinking of something like this:

 >>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show() +---+---+--------+ | a| b|distance| +---+---+--------+ | 1| 3| 2| | 1| 6| 5| | 2| 3| 1| | 2| 6| 4| | 5| 3| 2| | 5| 6| 1| +---+---+--------+ 

But there is a more efficient way to apply this distance using internal abs :

 >>> from pyspark.sql.functions import abs >>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b)) 

Then you can find the corresponding numbers by calculating:

 >>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b)) >>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance')) >>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show() +---+---+ | a| b| +---+---+ | 5| 6| | 1| 3| | 2| 3| +---+---+ 
+5
source

Source: https://habr.com/ru/post/1259182/


All Articles