A pbzip2 stream is nothing more than a concatenation of multiple bzip2 streams.
Shell Usage Example:
bzip2 < /usr/share/dict/words > words_x_1.bz2 cat words_x_1.bz2{,,,,,,,,,} > words_x_10.bz2 time bzip2 -d < words_x_10.bz2 > /dev/null time pbzip2 -d < words_x_10.bz2 > /dev/null
I never used the python bz2 module, but it would be easy to close / reopen the stream in 'a' ppend mode, each with so many bytes, to get the same result. Note that if a BZ2File is created from an existing file object, closing the BZ2File will not close the underlying stream (which you need here).
I did not estimate how many bytes are optimal for chunking, but I would guess every 1-20 megabytes - it should definitely be larger than the bzip2 (900k) block size.
Please also note that if you record the compressed and uncompressed offsets of each fragment, you can make quite effective random access. This is dictzip , although it is based on gzip .
source share