Creating a zip stream without using temporary files

I have a python method that needs to collect a lot of data from an API, format it in CSV, compress and return the result.

I was Googling, and every solution I can find either requires writing to a temp file, or storing the entire archive in memory.

Memory is definitely not an option, since I quickly get OOM. Writing to a temporary file has many problems associated with it (in this field only the log disk is used at the moment, a much longer time before downloading, problems with cleaning files, etc. Etc.). Not to mention that this is just nasty.

I am looking for a library that will allow me to do something like ...

C = Compressor(outputstream) C.BeginFile('Data.csv') for D in Api.StreamResults(): C.Write(D) C.CloseFile() C.Close() 

In other words, something that will write the output stream when writing data.

I managed to do this in .Net and PHP, but I have no idea how to approach it in Python.

To imagine things in perspective, by "lots" of data, I mean that I need to be able to process up to ~ 10 GB (unprocessed plaintext) of data. This is part of the export / dump process for a large data system.

+6
source share
2 answers

As stated in the gzip documentation, you can pass a file-like object to the GzipFile constructor. Since python utilities, you can implement your own thread, for example:

 import sys from gzip import GzipFile class MyStream(object): def write(self, data): #write to your stream... sys.stdout.write(data) #stdout, for example gz= GzipFile( fileobj=MyStream(), mode='w' ) gz.write("something") 
+6
source

@goncaplopp the answer is great, but you can achieve more parallelism if you run gzip from the outside. Since you collect a lot of data, it can be worth the extra effort. You will need to find your own compression procedure for Windows (there are several gzip implementations, but what works 7z may work). You can also experiment with things like lz that compress more than gzip, depending on what else needs to be optimized on your system.

 import subprocess as subp import os class GZipWriter(object): def __init__(self, filename): self.filename = filename self.fp = None def __enter__(self): self.fp = open(self.filename, 'wb') self.proc = subp.Popen(['gzip'], stdin=subp.PIPE, stdout=self.fp) return self def __exit__(self, type, value, traceback): self.close() if type: os.remove(self.filename) def close(self): if self.fp: self.fp.close() self.fp = None def write(self, data): self.proc.stdin.write(data) with GZipWriter('sometempfile') as gz: for i in range(10): gz.write('a'*80+'\n') 
+5
source

Source: https://habr.com/ru/post/955026/


All Articles