Reading and writing an empty string "" versus NULL in Spark 2.0.1

CSVFileFormat seems to read and write empty values โ€‹โ€‹as null for string columns. I searched around, but could not find clear information about this, so I put together a simple test.

 val df = session.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), (4, null) )) df.coalesce(1).write.mode("overwrite").format("csv") .option("delimiter", ",") .option("nullValue", "unknown") .option("treatEmptyValuesAsNulls", "false") .save(s"$path/test") 

It is output:

 0,a 1,b 2,c 3,unknown 4,unknown 

Thus, it seems to handle both empty strings and null values โ€‹โ€‹as null . The same thing happens when reading a CSV file with empty quoted strings and zeros. Is there any way to treat them differently?

+10
source share
1 answer

After only two and a half years, empty lines are no longer considered null due to Spark 2.4.0 ! Check out this commit to learn a little about functionality. Your code will behave as expected in 2.4. 0+:

 val df = session.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), (4, null) )) df.coalesce(1).write.mode("overwrite").format("csv") .option("delimiter", ",") .option("nullValue", "unknown") .option("treatEmptyValuesAsNulls", "false") .save(s"$path/test") 

Results in:

 0,a 1,b 2,c 3, 4,unknown 
0
source

Source: https://habr.com/ru/post/1013047/


All Articles