Practical random data compression

Question

Practical random data compression

So, yesterday I asked about the compression of a sequence of integers ( link ), and most of the comments had a similar point: if the order is random (or worse, the data is completely random), then you need to set the bit log2 (k) for the value k. I also read similar answers in other questions on this site. Now I hope this is not a stupid question, if I take this sequence and serialize it to a file and then run gzip in this file, then I really achieve compression (and depending on the time I allow gzip to run, I could would get high compression). Can anyone explain this fact?

Thanks in advance.

+4

algorithm gzip compression

jplot 21 sept '12 at 17:12

source share

3 answers

If the data is really random, on average, the compression algorithm cannot compress it. But if the data has some predictable patterns (for example, if the probability of a character depends on the previous k-characters found in the data), many compression (prediction) algorithms will be successful.

+3

krjampani 21 sept '12 at 17:20

source share

if I take this sequence and serialize it to a file and then run gzip in this file I really achieve compression

What is this? If you take random bytes (each evenly distributed at 0..255) and feed them to gzip or any compressor, you can in very rare cases get a small amount of compression, but most of the time you will get a small amount of extension.

+2

Mark adler 21 sept '12 at 20:20

source share

rici · Accepted Answer · 2012-09-21T21:07:02+0000

I assume that you are seeking compression in your random file because you are not using the optimal serialization technique, but without any details it is impossible to answer your question. Is a compressed file with n numbers in the range [0, k) less than n * log2 (k) bits? (That is, n * log256 (k) bytes). If so, can gzip do this for all the random files you create, or just occasionally?

Let me say one thing: suppose you tell me: "I created a random octet file using uniform_integration (0, 255) with mt19937 prng [1]. What is the optimal compression of my file?" Now my answer may be reasonable: "maybe about 80 bits." All I need to play your file is

the value you used to extract prng, possibly a 32-bit integer [2]; and
the length of the file, which probably fits in 48 bits.

And if I can play the data file with 80 bits of data, this is optimal compression. Unfortunately, this is not a general-purpose compression strategy. It is very unlikely that gzip will be able to understand that you used a particular prng to create the file, and even more so that it will be able to reprogram the seed (although these things are at least theoretically achievable, the Mersenne twister is not cryptographically protected by prng.)

In another example, it is usually recommended that the text be compressed before encryption; the result will be slightly shorter than compression after encryption. But the fact is that encryption adds very little entropy; at most, it adds the number of bits to the encryption key. However, the result is difficult to distinguish from random data, and gzip will try to compress it (although it often manages to squeeze a few bits).

Note 1: Note: all C ++ 11 / boost lingo. mt19937 is an instance of the Marsenne twister pseudo-random number generator, which has a period of 2 ^ 19937 - 1.

>

Note 2: The state of the Mersenne twister is actually 624 words (19968 bits), but most programs use slightly less bits for its seed. You may have used a 64-bit integer rather than a 32-bit integer, but it does not change the answer much.

Practical random data compression

More articles: