I have about 270k data pairs, each pair consists of one 32KiB and one 16KiB block.
When I save them to a single file, of course I get a very large file. But the data is easily compressed.
After compressing the 5.48GiB file using WinRAR with strong compression, the resulting file size is 37.4 MB.
But I need random access to each individual block, so I can only compress the blocks separately.
For this, I used the Deflate class provided by .NET, which reduced the file size to 382MiB (with which I could live).
But the speed is not good enough.
Most of the speed loss is probably due to the creation of a new instance of MemoryStream and Deflate for each block. But it seems they are not intended to be reused.
And I think (much?) A better compression can be achieved if a "global" dictionary is used instead, and one for each block.
Is there an implementation of a compression algorithm (preferably in C #) that is suitable for this task?
The following link contains the percentage with which each byte number occurs, divided into three types of blocks (only 32KiB blocks only). The first and third type of blocks have 37.5%, and the second 25%. block percentages
The story of a long file: Type1 consists mainly of them. Type 2 consists mainly of zeros and ones. Type 3 consists mainly of zeros. Values โโgreater than 128 are not found (yet).
The 16KiB block almost always consists of zeros
source share