Spark - how to skip or ignore empty gzip files when reading

I have several hundred folders with thousands of gzipped text files each in s3, and I'm trying to read them in a dataframe with spark.read.csv().

Among the files there are some with zero length, which leads to an error:

java.io.EOFException: Unexpected end of input stream

code:

df = spark.read.csv('s3n://my-bucket/folder*/logfiles*.log.gz',sep='\t',schema=schema)

I tried setting modeon DROPMALFORMEDand reading using sc.textFile(), but no luck.

What is the best way to handle empty or corrupt gzip files?

+6
source share
1 answer

Spark 2.1 , spark.sql.files.ignoreCorruptFiles. spark-submit pyspark:

- conf spark.sql.files.ignoreCorruptFiles = true

+3

Source: https://habr.com/ru/post/1016325/


All Articles