Download and unzip a .zip file without writing to disk

I managed to get my first python script to work, which downloads a list of .ZIP files from a URL, and then continues to extract the ZIP files and writes them to disk.

I am now at a loss to reach the next step.

My main goal is to download and extract the zip file and transfer the contents (CSV data) through the TCP stream. I would prefer not to write any zip files or extracted files to disk if I succeeded.

Here is my current script that works, but unfortunately has to write files to disk.

import urllib, urllister import zipfile import urllib2 import os import time import pickle # check for extraction directories existence if not os.path.isdir('downloaded'): os.makedirs('downloaded') if not os.path.isdir('extracted'): os.makedirs('extracted') # open logfile for downloaded data and save to local variable if os.path.isfile('downloaded.pickle'): downloadedLog = pickle.load(open('downloaded.pickle')) else: downloadedLog = {'key':'value'} # remove entries older than 5 days (to maintain speed) # path of zip files zipFileURL = "http://www.thewebserver.com/that/contains/a/directory/of/zip/files" # retrieve list of URLs from the webservers usock = urllib.urlopen(zipFileURL) parser = urllister.URLLister() parser.feed(usock.read()) usock.close() parser.close() # only parse urls for url in parser.urls: if "PUBLIC_P5MIN" in url: # download the file downloadURL = zipFileURL + url outputFilename = "downloaded/" + url # check if file already exists on disk if url in downloadedLog or os.path.isfile(outputFilename): print "Skipping " + downloadURL continue print "Downloading ",downloadURL response = urllib2.urlopen(downloadURL) zippedData = response.read() # save data to disk print "Saving to ",outputFilename output = open(outputFilename,'wb') output.write(zippedData) output.close() # extract the data zfobj = zipfile.ZipFile(outputFilename) for name in zfobj.namelist(): uncompressed = zfobj.read(name) # save uncompressed data to disk outputFilename = "extracted/" + name print "Saving extracted file to ",outputFilename output = open(outputFilename,'wb') output.write(uncompressed) output.close() # send data via tcp stream # file successfully downloaded and extracted store into local log and filesystem log downloadedLog[url] = time.time(); pickle.dump(downloadedLog, open('downloaded.pickle', "wb" )) 
+66
python unzip
Apr 19 '11 at 2:13 am
source share
8 answers

My suggestion would be to use a StringIO object. They emulate files, but are in memory. So you can do something like this:

 # get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo' from StringIO import StringIO zipdata = StringIO() zipdata.write(get_zip_data()) myzipfile = zipfile.ZipFile(zipdata) foofile = myzipfile.open('foo.txt') print foofile.read() # output: "hey, foo" 

Or simpler (apologies to Vishalu):

 myzipfile = zipfile.ZipFile(StringIO(get_zip_data())) for name in myzipfile.namelist(): [ ... ] 

In Python 3, use BytesIO instead of StringIO.

+50
Apr 19 '11 at 2:23
source share

Below is the code snippet that I used to get the archived CSV file, please see:

Python 2 :

 from StringIO import StringIO from zipfile import ZipFile from urllib import urlopen resp = urlopen("http://www.test.com/file.zip") zipfile = ZipFile(StringIO(resp.read())) for line in zipfile.open(file).readlines(): print line 

Python 3 :

 from io import BytesIO from zipfile import ZipFile from urllib.request import urlopen # or: requests.get(url).content resp = urlopen("http://www.test.com/file.zip") zipfile = ZipFile(BytesIO(resp.read())) for line in zipfile.open(file).readlines(): print(line.decode('utf-8')) 

Here file is a string. To get the actual string you want to pass, you can use zipfile.namelist() . For example,

 resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip') zipfile = ZipFile(BytesIO(resp.read())) zipfile.namelist() # ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms'] 
+63
Apr 19 2018-11-11T00:
source share

I would like to offer an updated version of Vishal version of Python 3, which used Python 2, as well as some explanation of adaptations / changes, which may have already been mentioned.

 from io import BytesIO from zipfile import ZipFile import urllib.request url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip") with ZipFile(BytesIO(url.read())) as my_zip_file: for contained_file in my_zip_file.namelist(): # with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output: for line in my_zip_file.open(contained_file).readlines(): print(line) # output.write(line) 

Necessary changes:

  • There is no StringIO in Python 3. Instead, I use io , and from it I import BytesIO , because we will handle bytestream - Docs , also this stream .
  • urlopen:
    • "The deprecated urllib.urlopen function from Python 2.6 and has previously been discontinued, urllib.request.urlopen () matches the old urllib2.urlopen.", Docs .
  • import urllib.request:
    • This thread .

Note:

  • In Python 3, printed output lines will look like this: b'some text' . This is expected since they are not strings - remember, we are reading a byte stream. Check out Dan04's excellent answer .

A few minor changes I made:

  • I use with ... as instead of zipfile = ... according to Docs .
  • The script now uses namelist() to cycle through all files in zip and print their contents.
  • I moved the creation of the ZipFile object to the with-statement statement, although I'm not sure if this is better.
  • I added (and commented out) the ability to write a bytestream file to a file (per file in a zip file) in response to a comment by NumenorForLife; it adds "unzipped_and_read_" to the beginning of the file name and the extension ".file" (I prefer not to use ".txt" for files with bytestrings). Of course, the indentation of the code should be adjusted if you want to use it.
    • You need to be careful here - because we have a byte string, we use binary mode, so "wb" ; I have the feeling that writing binary files anyway opens a can of worms ...
  • I am using an example file, UN / LOCODE text archive :

What I did not do:

  • NumenorForLife asked to save the zip to disk. I'm not sure what he meant by this - downloading a zip file? This is a different task; see. Excellent answer Oleg Pripin .

Here is the way:

 import urllib.request import shutil with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file: shutil.copyfileobj(response, out_file) 
+18
Feb 08 '17 at 16:49
source share

write to a temporary file that is in RAM

Turns out the tempfile module ( http://docs.python.org/library/tempfile.html ) has just:

tempfile.SpooledTemporaryFile ([max_size = 0 [, mode = 'w + b' [, bufsize = -1 [, suffix = '' [, prefix = 'tmp' [, dir = None]]]]]])

This function works just like TemporaryFile (), except that the data is buffered in memory as long as the file is larger than max_size, or while the fileno () method is called: which indicates that the contents are written to disk and work continues as with TemporaryFile ().

As a result, the file has one additional method, rollover (), which calls the file to go to the file on disk regardless of its size.

The returned object is a file object whose _file attribute is a StringIO object or a true object file, depending on whether rollover () is called. This file-like object can be used in s like a regular file.

New in version 2.6.

or if you are lazy and you have tmpfs-installed /tmp on Linux, you can just make a file there, but you have to delete it yourself and deal with naming

+15
Apr 19 '11 at 2:16
source share

I would like to add my Python3 answer for completeness:

 from io import BytesIO from zipfile import ZipFile import requests def get_zip(file_url): url = requests.get(file_url) zipfile = ZipFile(BytesIO(url.content)) zip_names = zipfile.namelist() if len(zip_names) == 1: file_name = zip_names.pop() extracted_file = zipfile.open(file_name) return extracted_file return [zipfile.open(file_name) for file_name in zip_names] 
+14
Jan 18 '16 at 19:58
source share

Adding to other answers using queries :

  # download from web import requests url = 'http://mlg.ucd.ie/files/datasets/bbc.zip' content = requests.get(url) # unzip the content from io import BytesIO from zipfile import ZipFile f = ZipFile(BytesIO(content.content)) print(f.namelist()) # outputs ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms'] 

Use help (f) to get more information about functions, for example extractall (), which extracts the contents into a zip file that can later be used with open .

+10
Mar 07 '18 at 11:00
source share

In Vishal, it was not visible what the file name should have been when there was no file on the disk. I changed my response to work unchanged for most needs.

 from StringIO import StringIO from zipfile import ZipFile from urllib import urlopen def unzip_string(zipped_string): unzipped_string = '' zipfile = ZipFile(StringIO(zipped_string)) for name in zipfile.namelist(): unzipped_string += zipfile.open(name).read() return unzipped_string 
+2
Jun 07 '15 at 5:00
source share

The Vishal example, no matter how great it is, is confusing when it comes to the file name, and I see no reason to redefine "zipfile".

Here is my example that downloads a zip file containing several files, one of which is a csv file, which I later read in the DataFrame panda:

 from StringIO import StringIO from zipfile import ZipFile from urllib import urlopen import pandas url = urlopen("https://www.federalreserve.gov/apps/mdrm/pdf/MDRM.zip") zf = ZipFile(StringIO(url.read())) for item in zf.namelist(): print("File in zip: "+ item) # find the first matching csv file in the zip: match = [s for s in zf.namelist() if ".csv" in s][0] # the first line of the file contains a string - that line shall de ignored, hence skiprows df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0]) 

(Note I am using Python 2.7.13)

This is the exact solution that worked for me. I just tweaked it a bit for Python 3 by removing StringIO and adding the IO library

Python Version 3

 from io import BytesIO from zipfile import ZipFile import pandas import requests url = "https://www.nseindia.com/content/indices/mcwb_jun19.zip" content = requests.get(url) zf = ZipFile(BytesIO(content.content)) for item in zf.namelist(): print("File in zip: "+ item) # find the first matching csv file in the zip: match = [s for s in zf.namelist() if ".csv" in s][0] # the first line of the file contains a string - that line shall de ignored, hence skiprows df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0]) 
+1
Oct. 10 '17 at 21:35
source share



All Articles