We have several large data files that are concatenated, compressed, and then sent to another server. Compression reduces the transfer time to the destination server, so the less we can get the file in a short amount of time, the better. This is a very time sensitive process.
Data files contain many lines of text with tab delimiters, and the order of the lines does not matter.
We noticed that when we sorted the file by the first field, the size of the compressed file was much smaller, apparently because duplicates of this column are next to each other. However, sorting a large file is slow, and there is no real reason why it should be sorted, other than what happens with improved compression. There is also no relationship between what is in the first column and what is in the subsequent columns. There may be some order of lines that are compressed even less, or, alternatively, there may be an algorithm that could similarly increase compression performance, but take less time to run.
What approach can be used to arrange strings to optimize similarity between adjacent strings and improve compression performance?
source
share