Why does the default filter disable null by default?

filterin base scala collections containing values null, it has the following (and rather intuitive) behavior:

scala> List("a", "b", null).filter(_ != "a")
res0: List[String] = List(b, null)

However, I was very surprised to learn that the following filter removes zeros in the spark frame:

scala> val df = List(("a", null), ( "c", "d")).toDF("A", "B")
scala> df.show
+---+----+
|  A|   B|
+---+----+
|  a|null|
|  c|   d|
+---+----+
scala> df.filter('B =!= "d").show
+---+---+
|  A|  B|
+---+---+
+---+---+

If I want to keep the values null, I have to add

df.filter('B =!= "d" || 'B.isNull).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+

Personally, I believe that removing zeros by default is very error prone. Why this choice? and why is not indicated in the api documentation? Am I missing something?

+4
source share
1 answer

This is because the standard for SQL should not be null - so Spark SQL follows this (but not Scala).

Spark data ciphers have zero secure equality, although

scala> df.filter($"B" <=> null).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+


scala> df.filter(not($"B" <=> "d")).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+

: , null . , "c"? . , ? . null false.

+1

Source: https://habr.com/ru/post/1694482/


All Articles