Compressing a sequence of integers providing random access

I have a sequence of n integers in a small range [0,k) , and all integers have the same frequency f (so the sequence size is n=fβˆ—k ). Now I am trying to compress this sequence by providing random access (which is the ith integer). The random access time does not have to be O (1). I'm more interested in achieving high compression due to higher random access time.

I have not tried with Huffman encoding as it assigns codes based on frequencies (and all my frequencies are the same). Perhaps I am missing a simple encoding for this particular case.

Any help or pointers would be appreciated.

Thanks in advance.

PS: Already asked in cs.stackexchange, but I ask here also for better coverage, sorry.

+4
source share
2 answers

If all your integers have the same frequency, then a fair approximation to optimal compression will be ceil(log2(k)) bits per integer. You can get a bit array from them at a constant time.

If k is painfully small (e.g. 3), the above method can take up enough space. But you can combine a fixed number of your small integers with a base- k number, which can fit more effectively into a fixed number of bits (you can also conveniently place the result in a standard-sized word). In any case, you can also access this encoding in constant time.

If your integers do not have the same frequency, optimal compression can lead to variable bit rates from different parts of your input, so just accessing the array will not work. In this case, an index structure is required for good random access performance: break the compressed data into pieces of convenient size, each of which can be unpacked sequentially, but this time it is limited by the block size.


If the frequency of each number is exactly the same, you can save some space by taking advantage of this, but this may not be enough to be useful.

The entropy of n random numbers in the range [0,k) is n log2(k) , which is log2(k) bits per number; this is the number of bits it takes to encode your numbers without using the exact frequency.

The entropy of distinguishable permutations f copies each of the elements k (where n=f*k ):

 log2( n!/(f!)^k ) = log2(n!) - k * log2(f!) 

Using the Stirling approximation (which is good here only if n and f large), gives:

 ~ n log2(n) - n log2(e) - k ( f log2(f) - f log2(e) ) = n log2(n) - n log2(e) - n log2(f) + n log2(e) = n ( log2(n) - log2(f) ) = n log2(n/f) = n log2(k) 

This means that if n large and k small, you won’t get a significant amount of space using the exact frequency of your input.

The general error from the Stirling approximation is above O(log2(n) + k log2(f)) , which O(log2(n)/n + log2(f)/f) is encoded by the number. This means that if your k so large that your f is small (i.e., each individual number has only a small number of copies), you can save some space with smart encoding. However, the question indicates that k is actually small.

+2
source

If you work out the number of possible combinations and take database 2, you can find the best compression, and I don’t think it will be so great in your case. With 16 frequency numbers 1, the number of possible messages is 16! and Excel tells me a blog base 2 of 16! is 44.25, while storing them in the form of 4-bit codes will take only 64 bits. (where there is more than one type that you want http://mathworld.wolfram.com/MultinomialCoefficient.html )

I think you will have the problem of mixing random access, because the only information you have is that there are fixed numbers for each type of element - in the whole sequence. This is not much information for entire sequences, and it says almost nothing about the first half of the sequence in isolation, because you may have more numbers in the first half and less in the second half.

+1
source

Source: https://habr.com/ru/post/1435338/


All Articles