Reading and writing an empty string "" versus NULL in Spark 2.0.1

Question

Reading and writing an empty string "" versus NULL in Spark 2.0.1

CSVFileFormat seems to read and write empty values as null for string columns. I searched around, but could not find clear information about this, so I put together a simple test.

 val df = session.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), (4, null) )) df.coalesce(1).write.mode("overwrite").format("csv") .option("delimiter", ",") .option("nullValue", "unknown") .option("treatEmptyValuesAsNulls", "false") .save(s"$path/test")

It is output:

 0,a 1,b 2,c 3,unknown 4,unknown

Thus, it seems to handle both empty strings and null values as null . The same thing happens when reading a CSV file with empty quoted strings and zeros. Is there any way to treat them differently?

+10

csv apache-spark

Kyro Dec 9 '16 at 10:03

source share

1 answer

bsplosion · Answer 1 · 2019-04-11T15:26:34+0000

After only two and a half years, empty lines are no longer considered null due to Spark 2.4.0 ! Check out this commit to learn a little about functionality. Your code will behave as expected in 2.4. 0+:

 val df = session.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), (4, null) )) df.coalesce(1).write.mode("overwrite").format("csv") .option("delimiter", ",") .option("nullValue", "unknown") .option("treatEmptyValuesAsNulls", "false") .save(s"$path/test")

Results in:

 0,a 1,b 2,c 3, 4,unknown

Reading and writing an empty string "" versus NULL in Spark 2.0.1

More articles: