With BZ2_bzDecompress, the path is slower than the bzip2 command

I use mmap / read + BZ2_bzDecompress to sequentially decompress a large file (29 GB). This is because I need to parse uncompressed XML data, but only small pieces are needed, and it seemed that it would be more efficient to do this sequentially than unzip the entire file (400 GB without compression), and then parse it. Interestingly, the decompression part is very slow - while the bzip2 shell command is able to execute bits over 52 MB per second (it used several timeout 10 bzip2 -c -k -d input.bz2 > output runs timeout 10 bzip2 -c -k -d input.bz2 > output and divided the produced files by 10), mine the program can not even do 2 MB / s, slowing down after a few seconds to 1.2 MB / s

The file I'm trying to process uses several bz2 streams, so I check BZ2_bzDecompress for BZ_STREAM_END , and if that happens, use BZ2_bzDecompressEnd( strm ); and BZ2_bzDecompressInit( strm, 0, 0 ) to restart with the next thread, if the file has not been fully processed. I also tried without BZ2_bzDecompressEnd , but that didn’t change anything (and I can’t see in the documentation how to handle multiple threads correctly)

The file will be mmap'ed before, where I also tried different combinations of flags, currently MAP_RDONLY , MAP_PRIVATE from madvise to MADV_SEQUENTIAL | MADV_WILLNEED | MADV_HUGEPAGE MADV_SEQUENTIAL | MADV_WILLNEED | MADV_HUGEPAGE MADV_SEQUENTIAL | MADV_WILLNEED | MADV_HUGEPAGE (I check the return value, and madvise does not report any problems, and I'm on the linux kernel 3.2x debian installation, which has great support)

When profiling, I was convinced that, in addition to some counters for measuring speed and printing, which were limited only once every n iterations, nothing else was done. This also applies to the modern multi-core server processor, where all the other kernels where it works, and it is bare metal, is not virtualized.

Any ideas on what I can do wrong / do to improve performance?

Update: thanks to the suggestion of James Chong, I tried to "exchange" mmap() with read() , and the speed is still the same. So it seems that mmap() not a problem (either this, or mmap() and read() share the main problem)

Update 2: I think that perhaps the malloc / free calls made in bzDecompressInit / bzDecompressEnd will cause me to set the bzalloc / bzfree of the bz_stream structure in a user implementation that only allocates memory for the first time and does not free it if the flag is set ( passed by opaque parameter = strm.opaque). It works great, but the speed does not increase again.

Update 3: I also tried fread () instead of read (), and yet the speed remains the same. We also tried a different number of bytes read and the size of the decompressed buffer data - no change.

Update 4: Reading speed is definitely not a problem, since I managed to achieve a speed of about 120 MB / s in sequential reading using only mmap ().

+4
source share
1 answer

Exchange, mmap flags have little to do with them. If bzip2 is slow, this is not due to file I / O.

I think your libbz2 has not been fully optimized. Recompile it with the most brutal gcc flags you can imagine.

My second idea was that there is some ELF overhead. In this case, the problem will disappear if you contact bz2 statically. (After that, you can think about how to do this quickly with dynamically loaded libbz2).

An important extension from the future: Libbz2 must be reentrant, thread safe and position independent. This means that the various C-flag compiled, and these flags do not have a good effect on performance (although they give much faster code). In extreme cases, I could even imagine a 5-10-fold slowdown compared to a single-threaded, non-PIC, without a second version.

+1
source

Source: https://habr.com/ru/post/1501655/


All Articles