Reorder lines in a text file for better compression

I have many huge text files that need to be compressed with the maximum ratio. Compression speed can be slow if decompression is fast enough.

Each line in these files contains one data set, and they can be saved in any order.

A similar problem with this: Sorting a file to optimize compression efficiency

But for me, compression speed is not a problem. Are there any ready-made tools for grouping similar strings? Or maybe just an algorithm that I can implement?

Sorting itself gave some improvement, but I suspect that much more is possible.

Each file has a length of about 600 million lines, ~ 40 bytes each, a total of 24 GB. Compress to ~ 10 GB using xz

+4
source share
1 answer

Here's a pretty naive algorithm:

  • Select the start line arbitrarily and write to the compression stream.
  • The remaining lines are> 0:
    • Save state of compression stream
    • For each remaining line in a text file:
      • write a string to the compression stream and write the resulting compressed length
      • return to saved state of compression stream
    • Write the line that led to the smallest compressed length in the compression stream
    • Free Saved State

, , , . O (n 2), , . , : , , .

zlib, deflateCopy, , , -, .

: , "" , . , . TSP , /

+1

Source: https://habr.com/ru/post/1680306/


All Articles