Is there a name for this compression algorithm?

Question

Is there a name for this compression algorithm?

Let's say you have a four-byte integer, and you want to compress it to fewer bytes. You can compress it because smaller values are more likely than larger values (i.e., the likelihood of a value decreasing with its value). You apply the following scheme to get a result of 1, 2, 3, or 4 bytes:

Note that in the description below (bits are based on the same level and go from the most significant to the least significant), i.e. the first bit refers to the most significant bit, the second bit to the next most significant bit, etc. ..)

If n <128, you encode it as one byte with the first set of bits to zero
If n> = 128 and n <16.384, you use an integer of two bytes. You set the first bit to one to indicate the second bit is zero. Then you use the remaining 14 bits to encode the number n.
If n> 16, 384 and n <2,097,152, you use three bytes of an integer. You set the first bit to one, the second bit to one, and the third bit to zero. You use the remaining 21 bits to encode n.
If n> 2,097,152 and n <268,435,456, you are using a four-byte integer. You set the first three bits in one and the fourth bit to zero. You use the remaining 28 bits to encode n.
If n> = 268,435,456 and n <4,294,967,296, you use an integer of five bytes. You set the first four bits to one and use the next 32 bits to set the exact value of n, in the form of four bytes an integer. The rest of the bits are not used.

Is there a name for this algorithm?

+6

compression

Michael Goldshteyn Apr 11 '11 at 18:41

source share

5 answers

it is very similar to Delugosz 'integer variable-length encoding

+3

mohaps Apr 11 '11 at 18:48

source share

Huffman coding refers to using fewer bits to store more general data in exchange for using more bits to store less common data.

+2

ikegami Apr 11 '11 at 18:44

source share

Your schema is similar to UTF-8, which is the encoding scheme used for Unicode text data.

The main difference is that each byte in the UTF-8 stream indicates whether it is the leading or the final byte, so the sequence can be read from the middle. With your circuit, the missing leading byte will make the rest of the file completely unreadable if a series of such values is saved. And reading such a sequence should begin at the beginning, and not in an arbitrary place.

+1

Jeffrey l whitledge Apr 11 '11 at 18:51

source share

Varint

Using the high bit of each byte to indicate “continue” or “stop”, and the remaining bits (7 bits of each byte in the sequence) are interpreted as ordinary binary code that encodes the actual value:

It sounds like “Base 128 Varint” used in “Google Protocol Buffers” .

related ways to compress integers

In short: this code is an integer in 2 parts: The first part in unary code, which indicates how many bits will be required to read the rest of the value, and the second part (of the specified width in bits) in more or less equal binary format, which encodes actual value.

This particular code is "streams" of a binary unary code, but other similar codes first pack the complete unary code and then the binary code, such as Elias gamma code .

I suspect this code is one of the “Start / Stop Codes” family, as described in:

Stephen Pidgeon - Start / Stop Codes - Procs. Data Compression Conference, 2001, IEEE Computer Society Press, 2001.

0

David cary May 21 '11 at 4:16

source share

Fred foo · Accepted Answer · 2011-04-11T18:55:19+0000

This is pretty close to variable length or base-128 size . The last name is due to the fact that each 7-bit block in your encoding can be considered a 128 base.

Is there a name for this compression algorithm?

More articles: