What is the best way to compress a list of similar but not identical lines?

Say I have a series of lines that are very similar, but absolutely not identical.

They may differ more or less, but the similarities can be seen with the naked eye.

All lengths are equal, each is 256 bytes. The total number of rows is less than 2 ^ 16.

What would be the best compression method for this case?

UPDATE ( data format ):

I canโ€™t share the data, but I can describe it pretty close to reality:

Imagine a notation (for example, the LOGO language), which is a sequence of commands for a device for moving and drawing on a plane. For instance:

U12 - move up 12 steps D64 - move down 64 steps C78 - change drawing color to 78 P1 - pen down (start drawing) 

etc.

All vocabulary of this language does not exceed the size of the English alphabet.

Then the line describes the whole image: "U12C6P1L74D74R74U74P0 ....".

Imagine a class of ten thousand children who, with the help of this language, were told to make a very concrete image: like the flag of their country. We get 10K lines that are all different and all the same at the same time.

Our task is to squeeze a whole string of lines as much as possible.

My suspicion here is that there is a way to use this similarity and the total length of the strings, while Huffman, for example. do not use it explicitly.

+4
source share
3 answers

Could you tell us what data? Maybe like a DNA sequence? how

AGCTGTGCGAGAGAGAGCGGTGGG ...

GGCTGTGCGAGCGAGAGCGGTGGG ...

CGCTGTGAGAGNGAGAGCGGTGGG ...

NGCTGTGCGAGAGAGAGCGGTGGG ...

GGCTGTGCGAGTGAGAGCGGTGGG ...

......

? Maybe not. In any case, there are two levels or two ways to think:

I think itโ€™s easy to solve your problem, but itโ€™s hard to choose the best way. You can create several methods for comparison using http://en.wikipedia.org/wiki/Data_compression and other tools.

+1
source

Since you have a correction width of 256 bytes, and this is a force of 2, I would try to convert a burrow or front-end algorithm with that size, or possibly double size. Then you can try the huffman code. Maybe you can try the 256 byte Hilbert curve and then bwt and mft?

0
source

"The total number of rows is less than 2 ^ 16." This is a small limited number, which makes your work very simple: why don't you save the lookup table (hash table) of all the rows that you saw before. You can then convert each row of 256 bytes to a double-byte index into this lookup table.

Then you have a sequence of 16-bit integers. These integers will contain patterns such as "after the pen goes down, the likelihood that the next command will start drawing is 90%." If the data contains such patterns, PPM is your choice. 7-zip has a high-quality PPM implementation. You can select it using the GUI or cmd line.

0
source

Source: https://habr.com/ru/post/1400812/


All Articles