How to read gz compressed file using pyspark

Question

How to read gz compressed file using pyspark

I have string data in a compressed .gz format. I have to read it in pyspark Below is a snippet of code

rdd = sc.textFile("data/label.gz").map(func)

But I could not read this file successfully. How to read a compressed gz file. I found a similar question here , but my current spark version is different from the version in this question. I expect that there will be some kind of built-in function, like in hadoop.

+4

python apache-spark pyspark

Shafiq Mar 13 '17 at 10:58

source share

2 answers

, , , , gzipped . , bzip2.

0

Tim 13 . '17 11:04

Yaron · Accepted Answer · 2017-03-13T11:05:00+0000

The sparks of the document clearly indicate that you can read the file gzautomatically:

Sparks, textFile, , . , textFile ( "/my/directory" ), textFile ( "/my/directory/.txt" ) textFile ( "/my/directory/.gz" ).

:

rdd = sc.textFile("data/label.gz")

print rdd.take(10)

, data/label.gz, 10 .

, data/label.gz hdfs . ?

How to read gz compressed file using pyspark

More articles: