How to use broadcast collection in udf?

Question

How to use broadcast collection in udf?

How to use the broadcast collection in Spark SQL 1.6.1 udf. UDP should be called from main SQL, as shown below.

sqlContext.sql("""Select col1,col2,udf_1(key) as value_from_udf FROM table_a""")

udf_1() must watch a small translation collection to return the value in the main sql.

+4

apache-spark apache-spark-sql

S. K Nov 18 '16 at 9:44

source share

1 answer

mtoto · Accepted Answer · 2016-11-18T11:42:49+0000

Here is a minimal reproducible example in pySpark, illustrating the use of broadcast variables to perform searches using a function lambdaas UDFinside an operator SQL.

# Create dummy data and register as table
df = sc.parallelize([
    (1,"a"),
    (2,"b"),
    (3,"c")]).toDF(["num","let"])
df.registerTempTable('table')

# Create broadcast variable from local dictionary
myDict = {1: "y", 2: "x", 3: "z"}
broadcastVar = sc.broadcast(myDict) 
# Alternatively, if your dict is a key-value rdd, 
# you can do sc.broadcast(rddDict.collectAsMap())

# Create lookup function and apply it
sqlContext.registerFunction("lookup", lambda x: broadcastVar.value.get(x))
sqlContext.sql('select num, let, lookup(num) as test from table').show()
+---+---+----+
|num|let|test|
+---+---+----+
|  1|  a|   y|
|  2|  b|   x|
|  3|  c|   z|
+---+---+----+

How to use broadcast collection in udf?

More articles: