Pyspark referencing an external request is not supported outside of WHERE

I need to join table 2 in pyspark and make this join not by the exact value from the right table, but by the closest value (since there is no exact match.

It works fine in regular SQL, but does not work in SparkSQL. I am using Spark 2.2.1

In plain SQL:

SELECT a.*,
(SELECT b.field2 FROM tableB b 
WHERE b.field1 = a.field1 
ORDER BY ABS(b.field2 - a.field2) LIMIT 1) as field2
FROM tableA a
ORDER BY a.field1

Works fine

in SparkSQL:

...
tableA_DF.registerTempTable("tableA")
tableB_DF.registerTempTable("tableB")

query = "SELECT a.*, \
(SELECT b.field2 FROM tableB b \
WHERE b.field1 = a.field1 \
ORDER BY ABS(b.field2 - a.field2) LIMIT 1) field2 \
FROM tableA a \
ORDER BY a.field1"

result_DF = spark.sql(query) 

I have the following exception:

pyspark.sql.utils.AnalysisException: u'Expressions that reference an external query are not supported outside of WHERE / HAVING clauses

If Spark 2.2.1 does not support it, what will work?

Thanks in advance, Gary

+6
source share
1 answer

, field2 , . :

...
tableA_DF.registerTempTable("tableA")
tableB_DF.registerTempTable("tableB")

query = "SELECT a.*, \
FIRST(b.field2) OVER (ORDER BY ABS(b.field2 - a.field2)) field2 \
FROM tableA a \
JOIN tableB b
ON a.field1 = b.field1 \
ORDER BY a.field1"

result_DF = spark.sql(query) 

Catalyst . , Spark 2.3.1 - .

, JOIN WHERE, Spark 2.4 : https://issues.apache.org/jira/browse/SPARK-18455

: , , SPARK-18455, 3.0.0 9/11/18. , , 2.x, , , Spark, , . , , Spark.

+1

Source: https://habr.com/ru/post/1691793/


All Articles