Best Compression Algorithm for an Integer Sequence

Question

Best Compression Algorithm for an Integer Sequence

I have a large array with an integer that is basically continuous, e.g. 1-100, 110-160, etc. All integers are positive. What would be the best algorithm to compress this?

I tried the deflation algorithm, but that only gives me 50% compression. Please note that the algorithm cannot be lost.

All rooms are unique and gradually increase.

Also, if you can point me to a java implementation of such an algorithm, that would be great.

+41

algorithm compression

pdeva Nov 12 '08 at 8:19

source share

15 answers

Daniel Lemire · Answer 1 · 2013-02-12 22:28

We have written the latest scientific articles that discuss the best schemes for this problem. Please look:

Daniel Lemrir and Leonid Boytsov, decoding billions of integers per second using vectorization, Software: practice and experience 45 (1), 2015. http://arxiv.org/abs/1209.2137

Daniel Lemrir, Nathan Kurz, Leonid Boytsov, SIMD compression and intersection of sorted integers, Software: practice and experience (for display) http://arxiv.org/abs/1401.6399

These include an extensive experimental evaluation.

You can find the full implementation of all methods in C ++ 11 on the Internet: https://github.com/lemire/FastPFor and https://github.com/lemire/SIMDCompressionAndIntersection

There are also C libraries: https://github.com/lemire/simdcomp and https://github.com/lemire/MaskedVByte

If you prefer Java, see https://github.com/lemire/JavaFastPFOR

CesarB · Answer 2 · 2008-11-12 10:57

First, pre-process the list of values by taking the difference between each value and the previous one (for the first value, suppose the previous one was zero). This should in your case basically give a sequence of them, which can be compressed much more easily using most compression algorithms.

So the PNG format improves compression (it performs one of several difference methods, followed by the same compression algorithm used by gzip).

Tamir · Answer 3 · 2008-11-12 08:35

Ok, I vote more reasonably. All you need to save is [int: startnumber] [int / byte / whatever: number of iterations] in this case, you will turn your array of examples into 4xInt value. After that you can compress as you want :)

brianegge · Answer 4 · 2009-07-04 07:52

Although you could develop your own algorithm specific to your data stream, it is probably easier to use a coding algorithm from the shelf. I conducted several tests of compression algorithms available in Java , and found the following compression ratios for a sequence of a million consecutive integers:

None 1.0 Deflate 0.50 Filtered 0.34 BZip2 0.11 Lzma 0.06

Marc Gravell · Answer 5 · 2009-07-04 08:06

What is the size of the numbers? In addition to other answers, you can consider encoding with an extended base length of 128, which allows you to store smaller numbers in single bytes, while maintaining a larger number. MSB means "there is another byte" - described here.

Combine this with other methods so that you keep the “skip size”, “accept size”, “skip size”, “accept size”, but noting that neither “skip” nor “accept” will never be zero, so we subtract one from each (which allows you to save an additional byte for several values)

So:

 1-100, 110-160

means skip 1 (suppose starting from scratch as it makes it easier), accept 100, skip 9, accept 51; subtract 1 from each, giving (in decimal places)

 0,99,8,50

which is encoded as (hex):

 00 63 08 32

If we wanted to skip / take a larger number - for example, 300; we subtract 1, giving 299 - but that exceeds 7 bits; starting at the small end, we encode blocks of 7 bits and MSB to indicate continuation:

 299 = 100101100 = (in blocks of 7): 0000010 0101100

starting at the small end:

 1 0101100 (leading one since continuation) 0 0000010 (leading zero as no more)

giving:

 AC 02

Thus, we can easily encode large numbers, but small numbers (which sound typical for skip / take) take up less space.

You can try to do this through deflate, but that may not help much more ...

If you don't want to deal with all this messy coding, then ... if you can create an integer array of values (0,99,8,60) - you can use protocol buffers with packed re-uint32 / uint64 - and it will do all the work for you; -p

I don't “make” Java, but here is the full C # implementation (borrowing some coding bits from my protobuf-net project):

 using System; using System.Collections.Generic; using System.IO; using System.Linq; static class Program { static void Main() { var data = new List<int>(); data.AddRange(Enumerable.Range(1, 100)); data.AddRange(Enumerable.Range(110, 51)); int[] arr = data.ToArray(), arr2; using (MemoryStream ms = new MemoryStream()) { Encode(ms, arr); ShowRaw(ms.GetBuffer(), (int)ms.Length); ms.Position = 0; // rewind to read it... arr2 = Decode(ms); } } static void ShowRaw(byte[] buffer, int len) { for (int i = 0; i < len; i++) { Console.Write(buffer[i].ToString("X2")); } Console.WriteLine(); } static int[] Decode(Stream stream) { var list = new List<int>(); uint skip, take; int last = 0; while (TryDecodeUInt32(stream, out skip) && TryDecodeUInt32(stream, out take)) { last += (int)skip+1; for(uint i = 0 ; i <= take ; i++) { list.Add(last++); } } return list.ToArray(); } static int Encode(Stream stream, int[] data) { if (data.Length == 0) return 0; byte[] buffer = new byte[10]; int last = -1, len = 0; for (int i = 0; i < data.Length; ) { int gap = data[i] - 2 - last, size = 0; while (++i < data.Length && data[i] == data[i - 1] + 1) size++; last = data[i - 1]; len += EncodeUInt32((uint)gap, buffer, stream) + EncodeUInt32((uint)size, buffer, stream); } return len; } public static int EncodeUInt32(uint value, byte[] buffer, Stream stream) { int count = 0, index = 0; do { buffer[index++] = (byte)((value & 0x7F) | 0x80); value >>= 7; count++; } while (value != 0); buffer[index - 1] &= 0x7F; stream.Write(buffer, 0, count); return count; } public static bool TryDecodeUInt32(Stream source, out uint value) { int b = source.ReadByte(); if (b < 0) { value = 0; return false; } if ((b & 0x80) == 0) { // single-byte value = (uint)b; return true; } int shift = 7; value = (uint)(b & 0x7F); bool keepGoing; int i = 0; do { b = source.ReadByte(); if (b < 0) throw new EndOfStreamException(); i++; keepGoing = (b & 0x80) != 0; value |= ((uint)(b & 0x7F)) << shift; shift += 7; } while (keepGoing && i < 4); if (keepGoing && i == 4) { throw new OverflowException(); } return true; } }

Ray Tayek · Answer 6 · 2008-11-12 08:31

compress the string "1-100, 110-160" or save the string in some binary representation and analyze it to restore the array

Sam Saffron · Answer 7 · 2009-07-04 08:34

In addition to other solutions:

You can find "dense" areas and use a bitmap to store them.

So for example:

If you have 1000 numbers in 400 ranges between 1000-3000, you can use one bit to indicate the existence of a number and two values to indicate a range. The total storage for this range is 2000 bits + 2 intervals, so you can save this information in 254 bytes, which is quite surprising, since even short integers will occupy two bytes each, so for this example you get 7 percent savings.

The denser the area, the better this algorithm, but at some point only saving the beginning and end will be cheaper.

Michael Dorfman · Answer 8 · 2008-11-12 11:34

I would combine the answers given by CesarB and Fernando Miguélez.

First save the differences between each value and the previous one. As CesarB noted, this will give you the consistency of the majority.

Then, in this sequence, use the path length coding compression algorithm. It will compress very well due to the large number of duplicate values.

Martin · Answer 9 · 2008-11-12 13:08

I would suggest taking a look at Huffman Coding , a special case of Arithmetic Coding . In both cases, you analyze your initial sequence to determine the relative frequencies of different values. More frequently encountered values are encoded with fewer bits than less frequently encountered values.

MahlerFive · Answer 10 · 2009-07-04 08:49

The basic idea that you probably should use is for each range of consecutive integers (I will call these ranges) to keep the starting number and size of the range. For example, if you have a list of 1000 integers, but there are only 10 separate ranges, you can save a total of 20 integers (1 starting number and 1 size for each range) to represent this data, which will represent a compression ratio of 98 % Fortunately, you can make a few more optimizations that will help in cases where the number of ranges is greater.

Save the offset of the start number relative to the previous start number, and not the start number itself. The advantage here is that fewer bits are usually required for the numbers you save (this may come in handy in subsequent optimization suggestions). In addition, if you just saved the start numbers, all these numbers will be unique, and storing the offset gives a chance that the numbers are close or even repeated, which may allow even more compression by another method used after.
Use the minimum number of bits for both types of integers. You can iterate over numbers to get the largest offset of the starting integer, as well as the size of the largest range. Then you can use the data type that most effectively stores these integers, and simply specify the data type or number of bits at the beginning of the compressed data. For example, if the largest starting integer offset is only 12,000 and the largest range is 9000, then you can use an unsigned integer of 2 bytes for all of these. You can then insert a pair of 2.2 at the beginning of the compressed data to show that 2 bytes are used for both integers. Of course, you can put this information in one byte using a bit of bit manipulation. If it is convenient for you to perform a lot of manipulation with large bits, you can save each number as the minimum possible number of bits, and not correspond to representations of 1, 2, 4 or 8 bytes.

Using these two optimizations, consider a few examples (each is 4000 bytes):

1000 integers, the largest offset - 500, 10 ranges
1000 integers, the largest offset - 100, 50 ranges
1000 integers, the largest offset - 50, 100 ranges

WITHOUT OPTIMIZATIONS

20 integers, 4 bytes each = 80 bytes. COMPRESSION = 98%
100 integers, 4 bytes each = 400 bytes. COMPRESSION = 90%
200 integers, 4 bytes each = 800 bytes. COMPRESSION = 80%

WITH OPTIMIZATIONS

1 byte header + 20 numbers, one byte = 21 bytes. COMPRESSION = 99.475%
1 byte header + 100 numbers, one byte = 101 bytes. COMPRESSION = 97.475%
1 byte header + 200 numbers, one byte = 201 bytes. COMPRESSION = 94.975%

Arun Iyer · Answer 11 · 2011-06-23 16:28

Your business is very similar to index compression in search engines. A popular compression algorithm is the PForDelta algorithm and the Simple16 algorithm. You can use the kamikaze library for your compression needs.

Fernando Miguélez · Answer 12 · 2008-11-12 08:26

If you have a series of repeated values, then RLE is easiest to implement and can give you a good result. However, other more advanced entrophy-sensitive algorithms, such as LZW, which now has no patents, can usually provide much better compression.

You can see these and other lossless algorithms here .

Deen Foxx · Answer 13 · 2016-05-04 21:11

I know this is an old message flow, but I include my personal PHP test of the SKIP / TAKE idea that I found here. I call my STEP (+) / SPAN (-). Perhaps this will help someone.

NOTE. I realized the ability to allow duplicate integers, as well as negative integers, even if the original question was about positive, non-duplicated integers. Feel free to tweak it if you want to try and shave a byte or two.

CODE:

  // $integers_array can contain any integers; no floating point, please. Duplicates okay. $integers_array = [118, 68, -9, 82, 67, -36, 15, 27, 26, 138, 45, 121, 72, 63, 73, -35, 68, 46, 37, -28, -12, 42, 101, 21, 35, 100, 44, 13, 125, 142, 36, 88, 113, -40, 40, -25, 116, -21, 123, -10, 43, 130, 7, 39, 69, 102, 24, 75, 64, 127, 109, 38, 41, -23, 21, -21, 101, 138, 51, 4, 93, -29, -13]; // Order from least to greatest... This routine does NOT save original order of integers. sort($integers_array, SORT_NUMERIC); // Start with the least value... NOTE: This removes the first value from the array. $start = $current = array_shift($integers_array); // This caps the end of the array, so we can easily get the last step or span value. array_push($integers_array, $start - 1); // Create the compressed array... $compressed_array = [$start]; foreach ($integers_array as $next_value) { // Range of $current to $next_value is our "skip" range. I call it a "step" instead. $step = $next_value - $current; if ($step == 1) { // Took a single step, wait to find the end of a series of seqential numbers. $current = $next_value; } else { // Range of $start to $current is our "take" range. I call it a "span" instead. $span = $current - $start; // If $span is positive, use "negative" to identify these as sequential numbers. if ($span > 0) array_push($compressed_array, -$span); // If $step is positive, move forward. If $step is zero, the number is duplicate. if ($step >= 0) array_push($compressed_array, $step); // In any case, we are resetting our start of potentialy sequential numbers. $start = $current = $next_value; } } // OPTIONAL: The following code attempts to compress things further in a variety of ways. // A quick check to see what pack size we can use. $largest_integer = max(max($compressed_array),-min($compressed_array)); if ($largest_integer < pow(2,7)) $pack_size = 'c'; elseif ($largest_integer < pow(2,15)) $pack_size = 's'; elseif ($largest_integer < pow(2,31)) $pack_size = 'l'; elseif ($largest_integer < pow(2,63)) $pack_size = 'q'; else die('Too freaking large, try something else!'); // NOTE: I did not implement the MSB feature mentioned by Marc Gravell. // I'm just pre-pending the $pack_size as the first byte, so I know how to unpack it. $packed_string = $pack_size; // Save compressed array to compressed string and binary packed string. $compressed_string = ''; foreach ($compressed_array as $value) { $compressed_string .= ($value < 0) ? $value : '+'.$value; $packed_string .= pack($pack_size, $value); } // We can possibly compress it more with gzip if there are lots of similar values. $gz_string = gzcompress($packed_string); // These were all just size tests I left in for you. $base64_string = base64_encode($packed_string); $gz64_string = base64_encode($gz_string); $compressed_string = trim($compressed_string,'+'); // Don't need leading '+'. echo "<hr>\nOriginal Array has " .count($integers_array) .' elements: {not showing, since I modified the original array directly}'; echo "<br>\nCompressed Array has " .count($compressed_array).' elements: ' .implode(', ',$compressed_array); echo "<br>\nCompressed String has " .strlen($compressed_string).' characters: ' .$compressed_string; echo "<br>\nPacked String has " .strlen($packed_string).' (some probably not printable) characters: ' .$packed_string; echo "<br>\nBase64 String has " .strlen($base64_string).' (all printable) characters: ' .$base64_string; echo "<br>\nGZipped String has " .strlen($gz_string).' (some probably not printable) characters: ' .$gz_string; echo "<br>\nBase64 of GZipped String has " .strlen($gz64_string).' (all printable) characters: ' .$gz64_string; // NOTICE: The following code reverses the process, starting form the $compressed_array. // The first value is always the starting value. $current_value = array_shift($compressed_array); $uncompressed_array = [$current_value]; foreach ($compressed_array as $val) { if ($val < -1) { // For ranges that span more than two values, we have to fill in the values. $range = range($current_value + 1, $current_value - $val - 1); $uncompressed_array = array_merge($uncompressed_array, $range); } // Add the step value to the $current_value $current_value += abs($val); // Add the newly-determined $current_value to our list. If $val==0, it is a repeat! array_push($uncompressed_array, $current_value); } // Display the uncompressed array. echo "<hr>Reconstituted Array has " .count($uncompressed_array).' elements: ' .implode(', ',$uncompressed_array). '<hr>';

OUTPUT:

 -------------------------------------------------------------------------------- Original Array has 63 elements: {not showing, since I modified the original array directly} Compressed Array has 53 elements: -40, 4, -1, 6, -1, 3, 2, 2, 0, 8, -1, 2, -1, 13, 3, 6, 2, 6, 0, 3, 2, -1, 8, -11, 5, 12, -1, 3, -1, 0, -1, 3, -1, 2, 7, 6, 5, 7, -1, 0, -1, 7, 4, 3, 2, 3, 2, 2, 2, 3, 8, 0, 4 Compressed String has 110 characters: -40+4-1+6-1+3+2+2+0+8-1+2-1+13+3+6+2+6+0+3+2-1+8-11+5+12-1+3-1+0-1+3-1+2+7+6+5+7-1+0-1+7+4+3+2+3+2+2+2+3+8+0+4 Packed String has 54 (some probably not printable) characters: cØÿÿÿÿ ÿõ ÿÿÿÿÿÿ Base64 String has 72 (all printable) characters: Y9gE/wb/AwICAAj/Av8NAwYCBgADAv8I9QUM/wP/AP8D/wIHBgUH/wD/BwQDAgMCAgIDCAAE GZipped String has 53 (some probably not printable) characters: xœ Ê» ÑÈÎ¯€)YšE¨MŠ"^qçºR¬m&Òõ‹%Ê&TFÊ‰ùÀ6ÿÁÁ Æ Base64 of GZipped String has 72 (all printable) characters: eJwNyrsNACAMA9HIzq+AKVmaRahNipNecee6UgSsBW0m0gj1iyXKJlRGjcqJ+cA2/8HBDcY= -------------------------------------------------------------------------------- Reconstituted Array has 63 elements: -40, -36, -35, -29, -28, -25, -23, -21, -21, -13, -12, -10, -9, 4, 7, 13, 15, 21, 21, 24, 26, 27, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 51, 63, 64, 67, 68, 68, 69, 72, 73, 75, 82, 88, 93, 100, 101, 101, 102, 109, 113, 116, 118, 121, 123, 125, 127, 130, 138, 138, 142 --------------------------------------------------------------------------------

powturbo · Answer 14 · 2016-07-08 16:07

TurboPFor: fast compression of integers

for C / C ++, including Java Critical Natives / JNI Interface
SIMD accelerated integer compression
Scalar + Integrated (SIMD) differential / Zigzag encoding / decoding for sorted / unsorted whole lists
Full range of 8/16/32/64 bit firewall
Direct access
Benchmark app

Antithesis · Answer 15 · 2017-11-08 21:02

I could not make my compression be much better than about 0.11 for this. I generated my test data using a python interpreter, and this is a list of lines with line separators from 1-100 and 110-160. I use the actual program as a compressed data representation. My compressed file is as follows:

 main=mapM_ print [x|x<-[1..160],x`notElem`[101..109]]

This is just a Haskell script that generates a file that you can run through:

 $ runhaskell generator.hs >> data

The g.hs file size is 54 bytes, and the data generated by python is 496 bytes. This gives 0.10887096774193548 as a compression ratio. I think that with a lot of time, you can compress the program or compress the compressed file (i.e. the haskell file).

Another approach may be to store 4 bytes of data. The minimum and maximum values of each sequence and then give them a generating function. Although when downloading files, more characters will be added for the decompressor, adding more complexity and more bytes to the decompressor. Again, I represented this very specific sequence through a program and did not generalize it, it is data-specific compression. In addition, the addition of generality makes the decompressor larger.

Another problem is that you need to have a Haskell interpreter for this. When I compiled the program, she made it much more. I really don't know why. There is the same problem with python, so perhaps the best approach is to give ranges so that some program can unzip the file.

Best Compression Algorithm for an Integer Sequence

More articles: