Comparison operator in PySpark (not equal to /! =)

I am trying to get all the rows in a data frame where two flags are set to "1", and then all those where only one of the two is set to "1" and the other is NOT EQUAL to '1'

With the following diagram (three columns),

df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
                            schema=('id', 'foo', 'bar')
                            )

I get the following data block:

+---+----+----+
| id| foo| bar|
+---+----+----+
|  a|   1|null|
|  b|   1|   1|
|  c|   1|null|
|  d|null|   1|
|  e|   1|   1|
+---+----+----+

When I apply the necessary filters, the first filter works (foo = 1 AND bar = 1), but not the other (foo = 1 AND NOT bar = 1)

foobar_df = df.filter( (df.foo==1) & (df.bar==1) )

gives:

+---+---+---+
| id|foo|bar|
+---+---+---+
|  b|  1|  1|
|  e|  1|  1|
+---+---+---+

Here is an invalid filter:

foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
foo_df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
+---+---+---+

Why doesn't it filter? How can I get columns where only foo is '1'?

+6
source share
2
+7

SQL, NULL . NULL, IS NULL IS NOT NULL . :

col("bar").isNull() | (col("bar") != 1)

coalesce(col("bar") != 1, lit(True))

(PySpark> = 2.3):

col("bar").eqNullSafe(1)

PySpark.

'null' NULL. None .

from pyspark.sql.functions import col, coalesce, lit

df = spark.createDataFrame([
    ('a', 1, 1), ('a',1, None), ('b', 1, 1),
    ('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')

df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+
+9

Source: https://habr.com/ru/post/1673515/


All Articles