PySpark replaces null in a column with a value in another column

Question

PySpark replaces null in a column with a value in another column

I want to replace the null values in one column with the values in the adjacent column, for example, if I have

A|B
0,1
2,null
3,null
4,2

I want this to be:

A|B
0,1
2,2
3,3
4,2

I tried

df.na.fill(df.A,"B")

But it didn’t work, it says the value should be float, int, long, string or dict

Any ideas?

+4

python apache-spark pyspark

Luis leal Mar 24 '17 at 2:45

source share

3 answers

Another answer.

If below is df1your dataframe

rd1 = sc.parallelize([(0,1), (2,None), (3,None), (4,2)])
df1 = rd1.toDF(['A', 'B'])

from pyspark.sql.functions import when
df1.select('A',
           when( df1.B.isNull(), df1.A).otherwise(df1.B).alias('B')
          )\
   .show()

+3

Rags Mar 24 '17 at 4:44

source share

df.rdd.map(lambda row: row if row[1] else Row(a=row[0],b=row[0])).toDF().show()

+1

Pushkr Mar 24 '17 at 3:20

source share

Luis leal · Accepted Answer · 2017-03-24T04:33:03+0000

In the end, I found an alternative:

df.withColumn("B",coalesce(df.B,df.A))

PySpark replaces null in a column with a value in another column

More articles: