Spark - how to skip or ignore empty gzip files when reading

Question

I have several hundred folders with thousands of gzipped text files each in s3, and I'm trying to read them in a dataframe with spark.read.csv().

Among the files there are some with zero length, which leads to an error:

java.io.EOFException: Unexpected end of input stream

code:

df = spark.read.csv('s3n://my-bucket/folder*/logfiles*.log.gz',sep='\t',schema=schema)

I tried setting modeon DROPMALFORMEDand reading using sc.textFile(), but no luck.

What is the best way to handle empty or corrupt gzip files?

+6

Antti Apr 05 '17 at 11:50

1 answer

Radhwane Chebaane · Answer 1 · 2017-07-26T15:54:23+0000

Spark 2.1 , spark.sql.files.ignoreCorruptFiles. spark-submit pyspark:

- conf spark.sql.files.ignoreCorruptFiles = true