Performance optimization UDF vs Spark sql vs and columns

Question

Performance optimization UDF vs Spark sql vs and columns

I understand that UDFthis is a complete black box for Spark, and no attempt will be made to optimize it. But will the type Columnand its functions listed in: ( https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.Column ) be used
to make the function " acceptable "for Catalyst Optimizer?.

For example, UDFto create a new column by adding 1to an existing column

val addOne = udf( (num: Int) => num + 1 )
df.withColumn("col2", addOne($"col1"))

The same function using Columntype:

def addOne(col1: Column) = col1.plus(1)
df.withColumn("col2", addOne($"col1"))

or

spark.sql("select *, col1 + 1 from df")

will there be a difference in performance between the two?

+4

scala apache-spark apache-spark-sql spark-dataframe

vdep Aug 3 '17 at 17:51

source share

2

, ,

- udf, , .

udf

-2

puhlen 03 . '17 18:12

Yosi Dahari · Accepted Answer · 2017-08-03T19:09:12+0000

6 2- 3- ~ 70 , , ( UDF - 0,7 )

val addOne = udf( (num: Int) => num + 1 )
val res1 = df.withColumn("col2", addOne($"col1"))
res1.show()
//df.explain()

def addOne2(col1: Column) = col1.plus(1)
val res2 = df.withColumn("col2", addOne2($"col1"))
res2.show()
//res2.explain()

val res3 = spark.sql("select *, col1 + 1 from df")
res3.show()

Timeline: UDF, , SQL:

(354.0 B), UDF:

Performance optimization UDF vs Spark sql vs and columns

More articles: