I need to join table 2 in pyspark and make this join not by the exact value from the right table, but by the closest value (since there is no exact match.
It works fine in regular SQL, but does not work in SparkSQL. I am using Spark 2.2.1
In plain SQL:
SELECT a.*,
(SELECT b.field2 FROM tableB b
WHERE b.field1 = a.field1
ORDER BY ABS(b.field2 - a.field2) LIMIT 1) as field2
FROM tableA a
ORDER BY a.field1
Works fine
in SparkSQL:
...
tableA_DF.registerTempTable("tableA")
tableB_DF.registerTempTable("tableB")
query = "SELECT a.*, \
(SELECT b.field2 FROM tableB b \
WHERE b.field1 = a.field1 \
ORDER BY ABS(b.field2 - a.field2) LIMIT 1) field2 \
FROM tableA a \
ORDER BY a.field1"
result_DF = spark.sql(query)
I have the following exception:
pyspark.sql.utils.AnalysisException: u'Expressions that reference an external query are not supported outside of WHERE / HAVING clauses
If Spark 2.2.1 does not support it, what will work?
Thanks in advance, Gary
source
share