How to read gz compressed file using pyspark

I have string data in a compressed .gz format. I have to read it in pyspark Below is a snippet of code

rdd = sc.textFile("data/label.gz").map(func)

But I could not read this file successfully. How to read a compressed gz file. I found a similar question here , but my current spark version is different from the version in this question. I expect that there will be some kind of built-in function, like in hadoop.

+4
source share
2 answers

The sparks of the document clearly indicate that you can read the file gzautomatically:

Sparks, textFile, , . , textFile ( "/my/directory" ), textFile ( "/my/directory/.txt" ) textFile ( "/my/directory/.gz" ).

:

rdd = sc.textFile("data/label.gz")

print rdd.take(10)

, data/label.gz, 10 .

, data/label.gz hdfs . ?

+5

, , , , gzipped . , bzip2.

0

Source: https://habr.com/ru/post/1672150/


All Articles