I have several hundred folders with thousands of gzipped text files each in s3, and I'm trying to read them in a dataframe with spark.read.csv()
.
Among the files there are some with zero length, which leads to an error:
java.io.EOFException: Unexpected end of input stream
code:
df = spark.read.csv('s3n://my-bucket/folder*/logfiles*.log.gz',sep='\t',schema=schema)
I tried setting mode
on DROPMALFORMED
and reading using sc.textFile()
, but no luck.
What is the best way to handle empty or corrupt gzip files?
Antti source
share