Passing null columns as a parameter in Spark SQL UDF

Here is Spark UDF, which I use to compute values ​​using multiple columns.

def spark_udf_func(s: String, i:Int): Boolean = { 
    // I'm returning true regardless of the parameters passed to it.
    true
}

val spark_udf = org.apache.spark.sql.functions.udf(spark_udf_func _)

val df = sc.parallelize(Array[(Option[String], Option[Int])](
  (Some("Rafferty"), Some(31)), 
  (null, Some(33)), 
  (Some("Heisenberg"), Some(33)),  
  (Some("Williams"), null)
)).toDF("LastName", "DepartmentID")

df.withColumn("valid", spark_udf(df.col("LastName"), df.col("DepartmentID"))).show()

+----------+------------+-----+
|  LastName|DepartmentID|valid|
+----------+------------+-----+
|  Rafferty|          31| true|
|      null|          33| true|
|Heisenberg|          33| true|
|  Williams|        null| null|
+----------+------------+-----+

Can someone explain why the value for the column is null for the last row?

When I checked the spark plan, I could understand that there is a condition in the plan for the case where it says that if column2 (DepartmentID) is null, it should return null.

== Physical Plan ==

*Project [_1#699 AS LastName#702, _2#700 AS DepartmentID#703, if (isnull(_2#700)) null else UDF(_1#699, _2#700) AS valid#717]
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), assertnotnull(input[0, scala.Tuple2, true])._1), true) AS _1#699, unwrapoption(IntegerType, assertnotnull(input[0, scala.Tuple2, true])._2) AS _2#700]
   +- Scan ExternalRDDScan[obj#698]

Why do we have this behavior in Spark?
Why only integer columns?
What am I doing wrong here, what is the correct way to handle zero in UDF when the UDF parameter is zero?

+4
2

, null scala Int ( ), String. Int java int primitive . , udf , null, null .

:

  • , java.lang.Integer( )
  • , , / - null. , (col ( "int col" ) isNull, someValue).otherother( )

+3

null, Integer ( Java Scala Int)

def spark_udf_func(s: String, i:Integer): Boolean = { 
    // I'm returning true regardless of the parameters passed to it.
    if(i == null) false else true
}
0

Source: https://habr.com/ru/post/1685076/


All Articles