I have a Python utility that processes a tar.xz file and processes each of the individual files. This is a 15 MB compressed file with 740 MB of uncompressed data.
On one specific server with very limited memory, the program crashes due to lack of memory. I used objgraph to see which objects are being created. Turns out TarInfo instances TarInfo not freed. The main loop is similar to this:
with tarfile.open(...) as tar: while True: next = tar.next() stream = tar.extractfile(next) process_stream() iter+=1 if not iter%1000: objgraph.show_growth(limit=10)
The output is very consistent:
TarInfo 2040 +1000 TarInfo 3040 +1000 TarInfo 4040 +1000 TarInfo 5040 +1000 TarInfo 6040 +1000 TarInfo 7040 +1000 TarInfo 8040 +1000 TarInfo 9040 +1000 TarInfo 10040 +1000 TarInfo 11040 +1000 TarInfo 12040 +1000
this continues until all 30,000 files have been processed.
Just to make sure, I commented out the lines that create the stream and process it. Memory usage has not changed - instances of TarInfo have leaked.
I am using Python 3.4.1 and this behavior is compatible with Ubuntu, OS X and Windows.
source share