I have a program that saves its output to a tar.bz2 file as it works. I have a python script that processes this data.
I would like to be able to work with the exit if the first program is interrupted, or just run a python script against it while the process continues.
Of course, the final bzip2 block is not completed, so it is impossible to read it - it is effectively damaged, although in fact it is simply truncated. GNU tar will actually be happy to extract everything it can from the file to this point - as bzcat will be, for that matter. And bzip2recover can create restored blocks, although in this case it is less useful than bzcat .
But I'm trying to use the Python tarfile standard. This is not performed using
File "/usr/lib64/python2.7/tarfile.py", line 2110, in extractfile tarinfo = self.getmember(member) File "/usr/lib64/python2.7/tarfile.py", line 1792, in getmember tarinfo = self._getmember(name) File "/usr/lib64/python2.7/tarfile.py", line 2361, in _getmember members = self.getmembers() File "/usr/lib64/python2.7/tarfile.py", line 1803, in getmembers self._load() # all members, we first have to File "/usr/lib64/python2.7/tarfile.py", line 2384, in _load tarinfo = self.next() File "/usr/lib64/python2.7/tarfile.py", line 2319, in next self.fileobj.seek(self.offset) EOFError: compressed file ended before the logical end-of-stream was detected
when I try to use TarFile.extractfile in a file that I know is at the beginning. ( tar -xf tarfile.tar.bz2 filename would extract it just fine.)
Is there anything clever that I can do to ignore the invalid end of the file and work with what I have?
A data set can become quite large and very, very compressible, so keeping it uncompressed is undesirable.
(I found the existing question Untar Archive in Python with errors , but in this case the user tries the os.system tar file.)