Is it possible to parallelize bz2 decompression?

Question

Is it possible to parallelize bz2 decompression?

I am using the pythons bz2 module to create (and compress) a large jsonl file (17z bzip2 compression).

However, when I later try to unpack it using pbzip2, it seems that only one processor core is used for decompression, which is rather slow.

When I compress it using pbzip2, it can use multiple cores for decompression. Is there any way to compress in python in pbzip2 format?

import bz2,sys from Queue import Empty #... compressor = bz2.BZ2Compressor(9) f = open(path, 'a') try: while 1: m = queue.get(True, 1*60) f.write(compressor.compress(m+"\n")) except Empty, e: pass except Exception as e: traceback.print_exc() finally: sys.stderr.write("flushing") f.write(compressor.flush()) f.close()

+5

python python-2.7 multiprocessing bzip2 bzip

worenga Sep 19 '17 at 12:47

source share

1 answer

o11c · Accepted Answer · 2017-09-19T19:25:10+0000

A pbzip2 stream is nothing more than a concatenation of multiple bzip2 streams.

Shell Usage Example:

 bzip2 < /usr/share/dict/words > words_x_1.bz2 cat words_x_1.bz2{,,,,,,,,,} > words_x_10.bz2 time bzip2 -d < words_x_10.bz2 > /dev/null time pbzip2 -d < words_x_10.bz2 > /dev/null

I never used the python bz2 module, but it would be easy to close / reopen the stream in 'a' ppend mode, each with so many bytes, to get the same result. Note that if a BZ2File is created from an existing file object, closing the BZ2File will not close the underlying stream (which you need here).

I did not estimate how many bytes are optimal for chunking, but I would guess every 1-20 megabytes - it should definitely be larger than the bzip2 (900k) block size.

Please also note that if you record the compressed and uncompressed offsets of each fragment, you can make quite effective random access. This is dictzip , although it is based on gzip .

Is it possible to parallelize bz2 decompression?

More articles: