How to replace all Null values ​​in a data frame in Pyspark

I have a data frame in pyspark with over 300 columns. There are several null columns in these columns.

For instance:

Column_1 column_2 null null null null 234 null 125 124 365 187 and so on 

When I want to make the sum of column_1, I get zero as a result instead of 724.

Now I want to replace the zero in all columns of the data frame with empty space. Therefore, when I try to make the sum of these columns, I do not get a null value, but I get a numerical value.

How can we achieve this in pyspark

+5
source share
2 answers

You can use df.na.fill to replace zeros with zeros, for example:

 >>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col']) >>> df.show() +----+ | col| +----+ | 1| | 2| | 3| |null| +----+ >>> df.na.fill(0).show() +---+ |col| +---+ | 1| | 2| | 3| | 0| +---+ 
+10
source

You can use the fillna () func function.

 >>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col']) >>> df.show() +----+ | col| +----+ | 1| | 2| | 3| |null| +----+ >>> df = df.fillna({'col':'4'}) >>> df.show() or df.fillna({'col':'4'}).show() +---+ |col| +---+ | 1| | 2| | 3| | 4| +---+ 
+3
source

Source: https://habr.com/ru/post/1264374/


All Articles