C # Compressing a large number of data blocks quickly and efficiently

I have about 270k data pairs, each pair consists of one 32KiB and one 16KiB block.

When I save them to a single file, of course I get a very large file. But the data is easily compressed.
After compressing the 5.48GiB file using WinRAR with strong compression, the resulting file size is 37.4 MB.

But I need random access to each individual block, so I can only compress the blocks separately.
For this, I used the Deflate class provided by .NET, which reduced the file size to 382MiB (with which I could live).
But the speed is not good enough.

Most of the speed loss is probably due to the creation of a new instance of MemoryStream and Deflate for each block. But it seems they are not intended to be reused.

And I think (much?) A better compression can be achieved if a "global" dictionary is used instead, and one for each block.

Is there an implementation of a compression algorithm (preferably in C #) that is suitable for this task?

The following link contains the percentage with which each byte number occurs, divided into three types of blocks (only 32KiB blocks only). The first and third type of blocks have 37.5%, and the second 25%. block percentages

The story of a long file: Type1 consists mainly of them. Type 2 consists mainly of zeros and ones. Type 3 consists mainly of zeros. Values โ€‹โ€‹greater than 128 are not found (yet).

The 16KiB block almost always consists of zeros

+4
source share
3 answers

If you want to try another compression, you can start with RLE, which will be suitable for your data - http://en.wikipedia.org/wiki/Run-length_encoding - it will be incredibly fast even in its simplest implementation. The linked http://en.wikipedia.org/wiki/Category:Lossless_compression_algorithms contains more links to run on a different algorithm if you want to quit your own or find someone.

Random comment: "... Probably a big loss of speed ..." is not a way to solve the performance problem. Measure and see if this is true.

+5
source

Gzip is known to be "excellent", which means that the compression ratio is fine and the speed is good. If you want more compression, there are other alternatives, such as 7z.

If you want to increase the speed that seems to be your goal, a faster alternative will provide a significant speed advantage due to some compression efficiency. Significant ones should be translated many times faster, for example 5x-10x. Such algorithms are preferred for in-memory compression scenarios such as yours, since they make accessing the compressed block almost painless.

As an example, Clayton Stangeland just released LZ4 for C #. The source code is available here under the BSD license: https://github.com/stangelandcl/LZ4Sharp

There are some indicators of comparison with gzip on the main page of the project, such as:

i5 memcpy 1658 MB/s i5 Lz4 Compression 270 MB/s Decompression 1184 MB/s i5 LZ4C# Compression 207 MB/s Decompression 758 MB/s 49% i5 LZ4C# whole corpus Compression 267 MB/s Decompression 838 MB/s Ratio 47% i5 gzip whole corpus Compression 48 MB/s Decompression 266 MB/s Ratio 33% 

Hope this helps.

+4
source

You cannot have random access to the Deflate stream, no matter how hard you try (unless you lose part of the LZ77, but this is what is basically responsible for making the compression ratio so high right now - and even then, there are difficult problems to get around). This is due to the fact that one part of the compressed data is allowed to refer to the previous part up to 32 Kbytes back, which can also relate to the other part in turn, etc., And you need to start decoding the stream from the very beginning to get the data that you want, even if you know exactly where it is in the compressed stream (which you donโ€™t currently have).

But, what you can do is compress many (but not all) blocks together with a single thread. Then you get pretty good speed and compression, but you donโ€™t have to unpack all the blocks to get the one you need; just the specific fragment your block is in. You need an additional index that keeps track of where each piece of compressed block runs in the file, but this is a pretty low overhead. Think of it as a compromise between compressing everything together (which is great for compression, but sucks for random access) and compressing each fragment individually (which is great for random access, but sucks for compression and speed).

+2
source

Source: https://habr.com/ru/post/1382010/


All Articles