Say I have a series of lines that are very similar, but absolutely not identical.
They may differ more or less, but the similarities can be seen with the naked eye.
All lengths are equal, each is 256 bytes. The total number of rows is less than 2 ^ 16.
What would be the best compression method for this case?
UPDATE ( data format ):
I canโt share the data, but I can describe it pretty close to reality:
Imagine a notation (for example, the LOGO language), which is a sequence of commands for a device for moving and drawing on a plane. For instance:
U12 - move up 12 steps D64 - move down 64 steps C78 - change drawing color to 78 P1 - pen down (start drawing)
etc.
All vocabulary of this language does not exceed the size of the English alphabet.
Then the line describes the whole image: "U12C6P1L74D74R74U74P0 ....".
Imagine a class of ten thousand children who, with the help of this language, were told to make a very concrete image: like the flag of their country. We get 10K lines that are all different and all the same at the same time.
Our task is to squeeze a whole string of lines as much as possible.
My suspicion here is that there is a way to use this similarity and the total length of the strings, while Huffman, for example. do not use it explicitly.