Numeric String Compression

Question

Numeric String Compression

Can anyone suggest compression algorithms for working with numeric strings of 20-30 digits?

+4

Vijay dev Feb 15 '10 at 16:11

7 answers

You can easily compress 30 character strings to 15 bytes just by using the binary representations of each digit. For example, 1592 can be represented as a series of four-digit values as such:

 0001 0101 1001 0010

This, when grouped into groups of two four-bit values, can be represented as § in standard ASCII.

In addition, if your lines contain many identical consecutive digits, you can implement a variation of Run-Length Encoding .

+7

Anton Gogolev Feb 15 '10 at 16:16

source share

Make 2 15-digit numbers and convert them to 2 64-bit integers? Or are they swimming?

+2

extraneon Feb 15 '10 at 16:16

source share

Split it into a pair of unsigned ints?

"9347692367596047327509604839"

becomes:

9 347692367 596047327 509604839

+2

luke Feb 15 '10 at 16:18

source share

One obvious solution is to “compress” them as a binary numeric representation rather than a string representation. See this stack overflow , e.g. library.

+1

David Gladfelter Feb 15 '10 at 16:17

source share

I would choose the simplest solution and just save them as integers (of the appropriate size, whether 32-bit, 64-bit or 128 bit, depending on the needs). Compression using an algorithm that supports characters takes up a lot of space, since it would have to serve a lot more than 10 different values (0-9) per character.

+1

Erik A. Brandstadmoen Feb 15 '10 at 16:21

source share

One of the most common ways to compress numbers (assuming you have more than one that you want to compress - it's hard to compress) uses delta coding . It works on the principle: if you know that the first number is x, and the numbers after it are relatively similar, you can encode subsequent numbers as (x + c1), (x + c2), etc.

In this scheme, you only need to encode the full x value once, and if your c values are less than your x, you can save a lot of space. You can also use a version of this that sorts the numbers first, and then your delta is among the last seen instead of a single number. With this method, you can make better use of a wider range of numbers.

+1

twolfe18 Feb 15 '10 at 16:28

source share

Brian R. bondy · Accepted Answer · 2010-02-15T16:24:50+0000

Assuming you can have floating point numbers, you can have 11 characters:

[0,1,2,3,4,5,6,7,8,9, .]

This means that you need 4 bits per character. 3 bits can contain only 8 characters. You can easily use 4 bits per character and get a lot of compression.

If you only have integer digits in your string, a simple solution is to convert to hexidecimal, and you can use 4 bits per character even if you get a better compression ratio. (since there are no unused bits with 16 characters)

If you use Huffman compression, you get the optimal bit / character ratio. You can learn more about Huffman compression here .

Numeric String Compression

More articles: