I have two zip file in hdf files in the same folder: /user/path-to-folder-with-zips/.
I pass this to the "binaryfiles" in pyspark:
zips = sc.binaryFiles('/user/path-to-folder-with-zips/')
I am trying to unzip the zip files and do something with the text files in them, so I tried just to see what would happen to the content when I try to figure out the RDD. I did it like this:
zips_collected = zips.collect()
But, when I do this, it gives an empty list:
>> zips_collected
[]
I know that lightnings are not empty - they have text files. The documentation here says
Each file is read as one record and returned in a key-value pair, where the key is the path to each file, the value is the contents of each file.
? , , , , . , , -. ?
zip , :
rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data