filterin base scala collections containing values null, it has the following (and rather intuitive) behavior:
scala> List("a", "b", null).filter(_ != "a")
res0: List[String] = List(b, null)
However, I was very surprised to learn that the following filter removes zeros in the spark frame:
scala> val df = List(("a", null), ( "c", "d")).toDF("A", "B")
scala> df.show
+---+----+
| A| B|
+---+----+
| a|null|
| c| d|
+---+----+
scala> df.filter('B =!= "d").show
+---+---+
| A| B|
+---+---+
+---+---+
If I want to keep the values null, I have to add
df.filter('B =!= "d" || 'B.isNull).show
+---+----+
| A| B|
+---+----+
| a|null|
+---+----+
Personally, I believe that removing zeros by default is very error prone. Why this choice? and why is not indicated in the api documentation? Am I missing something?
source
share