There are several effective ways to implement this. Start with the required import:
from pyspark.sql.functions import col, expr, when
You can use the Hive IF function inside expr:
new_column_1 = expr( """IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))""" )
or when + otherwise :
new_column_2 = when( col("fruit1").isNull() | col("fruit2").isNull(), 3 ).when(col("fruit1") == col("fruit2"), 1).otherwise(0)
Finally, you can use the following trick:
from pyspark.sql.functions import coalesce, lit new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))
With sample data:
df = sc.parallelize([ ("orange", "apple"), ("kiwi", None), (None, "banana"), ("mango", "mango"), (None, None) ]).toDF(["fruit1", "fruit2"])
you can use it as follows:
(df .withColumn("new_column_1", new_column_1) .withColumn("new_column_2", new_column_2) .withColumn("new_column_3", new_column_3))
and the result:
+------+------+------------+------------+------------+ |fruit1|fruit2|new_column_1|new_column_2|new_column_3| +------+------+------------+------------+------------+ |orange| apple| 0| 0| 0| | kiwi| null| 3| 3| 3| | null|banana| 3| 3| 3| | mango| mango| 1| 1| 1| | null| null| 3| 3| 3| +------+------+------------+------------+------------+