Why are my `binaryFiles` empty when I collect them in pyspark?

I have two zip file in hdf files in the same folder: /user/path-to-folder-with-zips/.

I pass this to the "binaryfiles" in pyspark:

zips = sc.binaryFiles('/user/path-to-folder-with-zips/')

I am trying to unzip the zip files and do something with the text files in them, so I tried just to see what would happen to the content when I try to figure out the RDD. I did it like this:

zips_collected = zips.collect()

But, when I do this, it gives an empty list:

>> zips_collected
[]

I know that lightnings are not empty - they have text files. The documentation here says

Each file is read as one record and returned in a key-value pair, where the key is the path to each file, the value is the contents of each file.

? , , , , . , , -. ?

zip , :

rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data
+1
1

, zip ( ). zip io.BytesIO. fooobar.com/questions/557401/....

import io
import gzip

def zip_extract(x):
    """Extract *.gz file in memory for Spark"""
    file_obj = gzip.GzipFile(fileobj=io.BytesIO(x[1]), mode="r")
    return file_obj.read()

zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
results = zip_data.map(zip_extract) \
                  .flatMap(lambda zip_file: zip_file.split("\n")) \
                  .map(lambda line: parse_line(line))
                  .collect()
+1

Source: https://habr.com/ru/post/1672154/


All Articles