How can I read from a damaged tar.bz2 file in Python?

Question

How can I read from a damaged tar.bz2 file in Python?

I have a program that saves its output to a tar.bz2 file as it works. I have a python script that processes this data.

I would like to be able to work with the exit if the first program is interrupted, or just run a python script against it while the process continues.

Of course, the final bzip2 block is not completed, so it is impossible to read it - it is effectively damaged, although in fact it is simply truncated. GNU tar will actually be happy to extract everything it can from the file to this point - as bzcat will be, for that matter. And bzip2recover can create restored blocks, although in this case it is less useful than bzcat .

But I'm trying to use the Python tarfile standard. This is not performed using

  File "/usr/lib64/python2.7/tarfile.py", line 2110, in extractfile tarinfo = self.getmember(member) File "/usr/lib64/python2.7/tarfile.py", line 1792, in getmember tarinfo = self._getmember(name) File "/usr/lib64/python2.7/tarfile.py", line 2361, in _getmember members = self.getmembers() File "/usr/lib64/python2.7/tarfile.py", line 1803, in getmembers self._load() # all members, we first have to File "/usr/lib64/python2.7/tarfile.py", line 2384, in _load tarinfo = self.next() File "/usr/lib64/python2.7/tarfile.py", line 2319, in next self.fileobj.seek(self.offset) EOFError: compressed file ended before the logical end-of-stream was detected

when I try to use TarFile.extractfile in a file that I know is at the beginning. ( tar -xf tarfile.tar.bz2 filename would extract it just fine.)

Is there anything clever that I can do to ignore the invalid end of the file and work with what I have?

A data set can become quite large and very, very compressible, so keeping it uncompressed is undesirable.

(I found the existing question Untar Archive in Python with errors , but in this case the user tries the os.system tar file.)

+4

python bzip2 tarfile corrupt-data

mattdm Feb 29 '12 at 1:36

source share

1 answer

SpliFF · Answer 1 · 2012-02-29T01:44:55+0000

There seem to be 2 possibilities. Firstly, and most likely:

If ignore_zeros is False, treat the empty block as the end of the archive. If this is true, skip the empty (and invalid) blocks and try to get as many participants as possible. This is only useful for reading concatenated or corrupted archives.

Secondly:

For special purposes, there is a second format for the mode: 'filemode | [compression] '. tarfile.open () will return a TarFile object that treats its data as a stream of blocks. There will be no random search in the file. If specified, fileobj can be any object that has a read () or write () method (depending on mode). bufsize indicates the block size and the default is 20 * 512 bytes. Use this option in combination with, for example, sys.stdin, a socket file file, or a tape device. However, such a TarFile is limited in that it does not allow random access.

It seems that accessing the file as a stream may be useful when the file is incomplete.

How can I read from a damaged tar.bz2 file in Python?

More articles: