How to load gsip-compressed csv file in pyspark?

File names do not end with .gz, and I cannot change them, because they are shared by other programs.

file1.log.gz.processedis just a file csv. But how to read it in pyspark, preferably in pyspark.sql?

I tried to specify the format and compression, but could not find the correct key / value. For example.

sqlContext.load(fn, format='gz')

does not work. Although Spark can deal with gzfiles, it seems to define a codec from file names. For example.

sc.textFile(fn)

will work if the file ends with .gz, but not in my case.

How can I instruct Spark to use the correct codec? Thanks!

+4
source share
1 answer

You should not use .load in this way since it is deprecated (since version 1.4.0). You must use read.format(source).schema(schema).options(options).load().

sql_context.read.format("com.databricks.spark.csv")
.options(
  header=... # e.g., "true"
  inferSchema=...)
.load(file_path + ".gz")
+1
source

Source: https://habr.com/ru/post/1618868/


All Articles