Difference between === null and isNull in Spark DataDrame

Question

Difference between === null and isNull in Spark DataDrame

I'm a little confused by the difference when we use

df.filter(col("c1") === null) and df.filter(col("c1").isNull)

The same data framework. I get accounts in === null, but a zero account in isNull. Please help me understand the difference. thanks

+10

sql scala dataframe apache-spark apache-spark-sql spark-dataframe

John Jan 08 '17 at 13:37

source share

2 answers

Usually, the best way to shed light on unexpected results in Spark Dataframes is to look at a plan of explanations. Consider the following example:

 import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions._ object Example extends App { val session = SparkSession.builder().master("local[*]").getOrCreate() case class Record(c1: String, c2: String) val data = List(Record("a", "b"), Record(null, "c")) val rdd = session.sparkContext.parallelize(data) import session.implicits._ val df: DataFrame = rdd.toDF val filtered = df.filter(col("c1") === null) println(filtered.count()) // <-- outputs 0, not expected val filtered2 = df.filter(col("c1").isNull) println(filtered2.count()) println(filtered2) // <- outputs 1, as expected filtered.explain(true) filtered2.explain(true) }

The foreground of the explanation shows:

 == Physical Plan == *Filter (isnotnull(c1#2) && null) +- Scan ExistingRDD[c1#2,c2#3] == Parsed Logical Plan == 'Filter isnull('c1) +- LogicalRDD [c1#2, c2#3]

This filter clause seems pointless. && to null ensures that this can never resolve true .

The second plan of explanation is as follows:

 == Physical Plan == *Filter isnull(c1#2) +- Scan ExistingRDD[c1#2,c2#3]

Here the filter is what they expect and want.

+2

mattinbits Jan 08 '17 at 14:23

source share

user6910411 · Accepted Answer · 2017-01-08T13:53:38+0000

First of all, do not use null in your Scala code, unless you need it for compatibility reasons.

As for your question, this is plain SQL. col("c1") === null interpreted as c1 = NULL and since NULL marks undefined values, the result is not defined for any value, including NULL itself.

 spark.sql("SELECT NULL = NULL").show

 +-------------+ |(NULL = NULL)| +-------------+ | null| +-------------+

 spark.sql("SELECT NULL != NULL").show

 +-------------------+ |(NOT (NULL = NULL))| +-------------------+ | null| +-------------------+

 spark.sql("SELECT TRUE != NULL").show

 +------------------------------------+ |(NOT (true = CAST(NULL AS BOOLEAN)))| +------------------------------------+ | null| +------------------------------------+

 spark.sql("SELECT TRUE = NULL").show

 +------------------------------+ |(true = CAST(NULL AS BOOLEAN))| +------------------------------+ | null| +------------------------------+

The only valid methods for checking for NULL :

IS NULL :

 spark.sql("SELECT NULL IS NULL").show

 +--------------+ |(NULL IS NULL)| +--------------+ | true| +--------------+

 spark.sql("SELECT TRUE IS NULL").show

 +--------------+ |(true IS NULL)| +--------------+ | false| +--------------+

IS NOT NULL :

 spark.sql("SELECT NULL IS NOT NULL").show

 +------------------+ |(NULL IS NOT NULL)| +------------------+ | false| +------------------+

 spark.sql("SELECT TRUE IS NOT NULL").show

 +------------------+ |(true IS NOT NULL)| +------------------+ | true| +------------------+

implemented in the DataFrame DSL as Column.isNull and Column.isNotNull respectively.

Note :

For NULL -safe comparisons, use IS DISTINCT / IS NOT DISTINCT :

 spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show

 +---------------+ |(NULL <=> NULL)| +---------------+ | true| +---------------+

 spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show

 +--------------------------------+ |(CAST(NULL AS BOOLEAN) <=> true)| +--------------------------------+ | false| +--------------------------------+

or not(_ <=> _) / <=>

 spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show

 +---------------+ |(col1 <=> col2)| +---------------+ | true| +---------------+

 spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show

 +---------------+ |(col1 <=> col2)| +---------------+ | false| +---------------+

in SQL and DataFrame DSL respectively.

Related :

Enabling null values in Apache Spark Join

Difference between === null and isNull in Spark DataDrame

More articles: