File sorting to optimize compression efficiency

Question

File sorting to optimize compression efficiency

We have several large data files that are concatenated, compressed, and then sent to another server. Compression reduces the transfer time to the destination server, so the less we can get the file in a short amount of time, the better. This is a very time sensitive process.

Data files contain many lines of text with tab delimiters, and the order of the lines does not matter.

We noticed that when we sorted the file by the first field, the size of the compressed file was much smaller, apparently because duplicates of this column are next to each other. However, sorting a large file is slow, and there is no real reason why it should be sorted, other than what happens with improved compression. There is also no relationship between what is in the first column and what is in the subsequent columns. There may be some order of lines that are compressed even less, or, alternatively, there may be an algorithm that could similarly increase compression performance, but take less time to run.

What approach can be used to arrange strings to optimize similarity between adjacent strings and improve compression performance?

+3

sorting algorithm compression

Ben morris Jun 10 '14 at 20:07

source share

3 answers

,

Name, Favourite drink, Favourite language, Favourite algorithm

, (, ) , zip , , , .

, , .

0

Peter de Rivaz 10 . '14 21:47

Simple: Just try using a different compression format. We found for our application (compressed SQLite db) that LZMA / 7z compresses about 4 times better than zip. Just say it before you implement anything.

0

benjist Jun 22 '14 at 23:36

source share

usr · Accepted Answer · 2014-06-10T22:45:52+0000

Here are some suggestions:

Divide the file into smaller batches and sort them. Sorting multiple small datasets is faster than sorting a single large fragment. You can also easily parallelize the work in this way.
An experiment with various compression algorithms. Different algorithms have different bandwidth and ratio. You are interested in algorithms that are on the Pareto border of these two dimensions.
Use large dictionary sizes. This allows the compressor to reference data that was previously in the past.

, , , . , , . , Qaru - , . , UserAgent HTTP- .

File sorting to optimize compression efficiency

More articles: