File names do not end with .gz, and I cannot change them, because they are shared by other programs.
file1.log.gz.processedis just a file csv. But how to read it in pyspark, preferably in pyspark.sql?
I tried to specify the format and compression, but could not find the correct key / value. For example.
sqlContext.load(fn, format='gz')
does not work. Although Spark can deal with gzfiles, it seems to define a codec from file names. For example.
sc.textFile(fn)
will work if the file ends with .gz, but not in my case.
How can I instruct Spark to use the correct codec? Thanks!
source
share