TarInfo Object Leak

I have a Python utility that processes a tar.xz file and processes each of the individual files. This is a 15 MB compressed file with 740 MB of uncompressed data.

On one specific server with very limited memory, the program crashes due to lack of memory. I used objgraph to see which objects are being created. Turns out TarInfo instances TarInfo not freed. The main loop is similar to this:

 with tarfile.open(...) as tar: while True: next = tar.next() stream = tar.extractfile(next) process_stream() iter+=1 if not iter%1000: objgraph.show_growth(limit=10) 

The output is very consistent:

 TarInfo 2040 +1000 TarInfo 3040 +1000 TarInfo 4040 +1000 TarInfo 5040 +1000 TarInfo 6040 +1000 TarInfo 7040 +1000 TarInfo 8040 +1000 TarInfo 9040 +1000 TarInfo 10040 +1000 TarInfo 11040 +1000 TarInfo 12040 +1000 

this continues until all 30,000 files have been processed.

Just to make sure, I commented out the lines that create the stream and process it. Memory usage has not changed - instances of TarInfo have leaked.

I am using Python 3.4.1 and this behavior is compatible with Ubuntu, OS X and Windows.

+6
source share
1 answer

Seems like it's actually by design. The TarFile object maintains a list of all the TarInfo objects that it contains in the members attribute. Each time you call next , the TarInfo object that it extracts from the archive is added to the list:

 def next(self): """Return the next member of the archive as a TarInfo object, when TarFile is opened for reading. Return None if there is no more available. """ self._check("ra") if self.firstmember is not None: m = self.firstmember self.firstmember = None return m # Read the next block. self.fileobj.seek(self.offset) tarinfo = None ... <snip> if tarinfo is not None: self.members.append(tarinfo) # <-- the TarInfo instance is added to members 

The members list will continue to grow as you retrieve more items. This allows you to use getmembers and getmember , but this is just a nuisance for your use case. It seems the best solution is to simply clear the members attribute on repeat (as suggested here ):

 with tarfile.open(...) as tar: while True: next = tar.next() stream = tar.extractfile(next) process_stream() iter+=1 tar.members = [] # Clear members list if not iter%1000: objgraph.show_growth(limit=10) 
+5
source

Source: https://habr.com/ru/post/976788/


All Articles