Numeric String Compression

Can anyone suggest compression algorithms for working with numeric strings of 20-30 digits?

+4
source share
7 answers

Assuming you can have floating point numbers, you can have 11 characters:

[0,1,2,3,4,5,6,7,8,9, .]

This means that you need 4 bits per character. 3 bits can contain only 8 characters. You can easily use 4 bits per character and get a lot of compression.

If you only have integer digits in your string, a simple solution is to convert to hexidecimal, and you can use 4 bits per character even if you get a better compression ratio. (since there are no unused bits with 16 characters)

If you use Huffman compression, you get the optimal bit / character ratio. You can learn more about Huffman compression here .

+2
source

You can easily compress 30 character strings to 15 bytes just by using the binary representations of each digit. For example, 1592 can be represented as a series of four-digit values โ€‹โ€‹as such:

 0001 0101 1001 0010 

This, when grouped into groups of two four-bit values, can be represented as ยง in standard ASCII.

In addition, if your lines contain many identical consecutive digits, you can implement a variation of Run-Length Encoding .

+7
source

Make 2 15-digit numbers and convert them to 2 64-bit integers? Or are they swimming?

+2
source

Split it into a pair of unsigned ints?

"9347692367596047327509604839"

becomes:

9 347692367 596047327 509604839

+2
source

One obvious solution is to โ€œcompressโ€ them as a binary numeric representation rather than a string representation. See this stack overflow , e.g. library.

+1
source

I would choose the simplest solution and just save them as integers (of the appropriate size, whether 32-bit, 64-bit or 128 bit, depending on the needs). Compression using an algorithm that supports characters takes up a lot of space, since it would have to serve a lot more than 10 different values โ€‹โ€‹(0-9) per character.

+1
source

One of the most common ways to compress numbers (assuming you have more than one that you want to compress - it's hard to compress) uses delta coding . It works on the principle: if you know that the first number is x, and the numbers after it are relatively similar, you can encode subsequent numbers as (x + c1), (x + c2), etc.

In this scheme, you only need to encode the full x value once, and if your c values โ€‹โ€‹are less than your x, you can save a lot of space. You can also use a version of this that sorts the numbers first, and then your delta is among the last seen instead of a single number. With this method, you can make better use of a wider range of numbers.

+1
source

Source: https://habr.com/ru/post/1301353/


All Articles