Is there a name for this compression algorithm?

Let's say you have a four-byte integer, and you want to compress it to fewer bytes. You can compress it because smaller values ​​are more likely than larger values ​​(i.e., the likelihood of a value decreasing with its value). You apply the following scheme to get a result of 1, 2, 3, or 4 bytes:

Note that in the description below (bits are based on the same level and go from the most significant to the least significant), i.e. the first bit refers to the most significant bit, the second bit to the next most significant bit, etc. ..)

  • If n <128, you encode it as one byte with the first set of bits to zero
  • If n> = 128 and n <16.384, you use an integer of two bytes. You set the first bit to one to indicate the second bit is zero. Then you use the remaining 14 bits to encode the number n.
  • If n> 16, 384 and n <2,097,152, you use three bytes of an integer. You set the first bit to one, the second bit to one, and the third bit to zero. You use the remaining 21 bits to encode n.
  • If n> 2,097,152 and n <268,435,456, you are using a four-byte integer. You set the first three bits in one and the fourth bit to zero. You use the remaining 28 bits to encode n.
  • If n> = 268,435,456 and n <4,294,967,296, you use an integer of five bytes. You set the first four bits to one and use the next 32 bits to set the exact value of n, in the form of four bytes an integer. The rest of the bits are not used.

Is there a name for this algorithm?

+6
source share
5 answers

This is pretty close to variable length or base-128 size . The last name is due to the fact that each 7-bit block in your encoding can be considered a 128 base.

+3
source

it is very similar to Delugosz 'integer variable-length encoding

+3
source

Huffman coding refers to using fewer bits to store more general data in exchange for using more bits to store less common data.

+2
source

Your schema is similar to UTF-8, which is the encoding scheme used for Unicode text data.

The main difference is that each byte in the UTF-8 stream indicates whether it is the leading or the final byte, so the sequence can be read from the middle. With your circuit, the missing leading byte will make the rest of the file completely unreadable if a series of such values ​​is saved. And reading such a sequence should begin at the beginning, and not in an arbitrary place.

+1
source

Varint

Using the high bit of each byte to indicate “continue” or “stop”, and the remaining bits (7 bits of each byte in the sequence) are interpreted as ordinary binary code that encodes the actual value:

It sounds like “Base 128 Varint” used in “Google Protocol Buffers” .

related ways to compress integers

In short: this code is an integer in 2 parts: The first part in unary code, which indicates how many bits will be required to read the rest of the value, and the second part (of the specified width in bits) in more or less equal binary format, which encodes actual value.

This particular code is "streams" of a binary unary code, but other similar codes first pack the complete unary code and then the binary code, such as Elias gamma code .

I suspect this code is one of the “Start / Stop Codes” family, as described in:

Stephen Pidgeon - Start / Stop Codes - Procs. Data Compression Conference, 2001, IEEE Computer Society Press, 2001.

0
source

Source: https://habr.com/ru/post/885673/


All Articles