Performance optimization UDF vs Spark sql vs and columns

I understand that UDFthis is a complete black box for Spark, and no attempt will be made to optimize it. But will the type Columnand its functions listed in: ( https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.Column ) be used
to make the function " acceptable "for Catalyst Optimizer?.

For example, UDFto create a new column by adding 1to an existing column

val addOne = udf( (num: Int) => num + 1 )
df.withColumn("col2", addOne($"col1"))

The same function using Columntype:

def addOne(col1: Column) = col1.plus(1)
df.withColumn("col2", addOne($"col1"))

or

spark.sql("select *, col1 + 1 from df")

will there be a difference in performance between the two?

+4
source share
2

6 2- 3- ​​ ~ 70 , , ( UDF - 0,7 )

val addOne = udf( (num: Int) => num + 1 )
val res1 = df.withColumn("col2", addOne($"col1"))
res1.show()
//df.explain()

def addOne2(col1: Column) = col1.plus(1)
val res2 = df.withColumn("col2", addOne2($"col1"))
res2.show()
//res2.explain()

val res3 = spark.sql("select *, col1 + 1 from df")
res3.show()

Timeline: UDF, , SQL: Timeline - the first two stages for UDF, the next two for the second option, and the last two for the sql spark

(354.0 B), UDF: Artist calculates time when using UDF

+1

, ,

- udf, , .

udf

-2

Source: https://habr.com/ru/post/1682968/


All Articles