How to load gsip-compressed csv file in pyspark?

Question

How to load gsip-compressed csv file in pyspark?

File names do not end with .gz, and I cannot change them, because they are shared by other programs.

file1.log.gz.processedis just a file csv. But how to read it in pyspark, preferably in pyspark.sql?

I tried to specify the format and compression, but could not find the correct key / value. For example.

sqlContext.load(fn, format='gz')

does not work. Although Spark can deal with gzfiles, it seems to define a codec from file names. For example.

sc.textFile(fn)

will work if the file ends with .gz, but not in my case.

How can I instruct Spark to use the correct codec? Thanks!

+4

apache-spark pyspark apache-spark-sql pyspark-sql

Shuaiyuan Dec 7 '15 at 10:15

source share

1 answer

Markon · Answer 1 · 2015-12-07T10:24:50+0000

You should not use .load in this way since it is deprecated (since version 1.4.0). You must use read.format(source).schema(schema).options(options).load().

sql_context.read.format("com.databricks.spark.csv")
.options(
  header=... # e.g., "true"
  inferSchema=...)
.load(file_path + ".gz")

How to load gsip-compressed csv file in pyspark?

More articles: