Why does the default filter disable null by default?

Question

Why does the default filter disable null by default?

filterin base scala collections containing values null, it has the following (and rather intuitive) behavior:

scala> List("a", "b", null).filter(_ != "a")
res0: List[String] = List(b, null)

However, I was very surprised to learn that the following filter removes zeros in the spark frame:

scala> val df = List(("a", null), ( "c", "d")).toDF("A", "B")
scala> df.show
+---+----+
|  A|   B|
+---+----+
|  a|null|
|  c|   d|
+---+----+
scala> df.filter('B =!= "d").show
+---+---+
|  A|  B|
+---+---+
+---+---+

If I want to keep the values null, I have to add

df.filter('B =!= "d" || 'B.isNull).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+

Personally, I believe that removing zeros by default is very error prone. Why this choice? and why is not indicated in the api documentation? Am I missing something?

+4

null sql apache-spark spark-dataframe

greg hor Mar 05 '18 at 14:45

source share

1 answer

Wilmerton · Accepted Answer · 2018-03-05T16:26:55+0000

This is because the standard for SQL should not be null - so Spark SQL follows this (but not Scala).

Spark data ciphers have zero secure equality, although

scala> df.filter($"B" <=> null).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+


scala> df.filter(not($"B" <=> "d")).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+

: , null . , "c"? . , ? . null false.

Why does the default filter disable null by default?

More articles: