PySpark: withColumn () with two conditions and three results

I work with Spark and PySpark. I am trying to achieve a result equivalent to the following pseudocode:

df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.) 

I am trying to do this in PySpark, but I am not sure of the syntax. Any pointers? I looked in expr() but could not get it to work.

Note that df is pyspark.sql.dataframe.DataFrame .

+27
source share
3 answers

There are several effective ways to implement this. Start with the required import:

 from pyspark.sql.functions import col, expr, when 

You can use the Hive IF function inside expr:

 new_column_1 = expr( """IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))""" ) 

or when + otherwise :

 new_column_2 = when( col("fruit1").isNull() | col("fruit2").isNull(), 3 ).when(col("fruit1") == col("fruit2"), 1).otherwise(0) 

Finally, you can use the following trick:

 from pyspark.sql.functions import coalesce, lit new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3)) 

With sample data:

 df = sc.parallelize([ ("orange", "apple"), ("kiwi", None), (None, "banana"), ("mango", "mango"), (None, None) ]).toDF(["fruit1", "fruit2"]) 

you can use it as follows:

 (df .withColumn("new_column_1", new_column_1) .withColumn("new_column_2", new_column_2) .withColumn("new_column_3", new_column_3)) 

and the result:

 +------+------+------------+------------+------------+ |fruit1|fruit2|new_column_1|new_column_2|new_column_3| +------+------+------------+------------+------------+ |orange| apple| 0| 0| 0| | kiwi| null| 3| 3| 3| | null|banana| 3| 3| 3| | mango| mango| 1| 1| 1| | null| null| 3| 3| 3| +------+------+------------+------------+------------+ 
+49
source

You want to use udf as below

 from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf def func(fruit1, fruit2): if fruit1 == None or fruit2 == None: return 3 if fruit1 == fruit2: return 1 return 0 func_udf = udf(func, IntegerType()) df = df.withColumn('new_column',func_udf(df['fruit1'], df['fruit2'])) 
+15
source

The withColumn function in pyspark allows you to create a new variable with conditions, add when and else functions, and you will have a properly working if if else structure. For all this, you will need to import the sparrsql functions, as you will see that the following code fragment will not work without the col () function. In the first bit, we declare a new column - the "new column", and then set the condition enclosed in the when function (ie fruit1 == fruit2), then give 1 if the condition is true, if it does not match the control, otherwise then it fulfills the second condition (fruit1 or fruit2 equals Null) using the isNull () function, and if the value 3 is returned, and if false, otherwise it is checked again, giving 0 as the answer

  from pyspark.sql import functions as F df=df.withColumn('new_column', F.when(F.col('fruit1')==F.col('fruit2),1) .otherwise(F.when((F.col('fruit1').isNull()) |(F.col('fruit2').isNull()),3)) .otherwise(0)) 
0
source

Source: https://habr.com/ru/post/1258561/


All Articles