How to efficiently sort a large file on two levels?

I have a very large file, more than 100 GB (many billions of lines), and I would like to carry out two-level sorting as quickly as possible in a Unix-system with limited memory. This will be one step in a large perl script, so I would like to use perl if possible.

So how can I do this? My data is as follows:

A 129 B 192 A 388 D 148 D 911 A 117 

... but for billions of lines. I need to sort by letter first and then by number. It would be easier to use unix sort, for example ...

 sort -k1,2 myfile 

Or can I do it all in perl? My system will have approximately 16 GB of memory, but the file is about 100 GB.

Thanks for any suggestions!

+6
source share
3 answers

The UNIX sort utility can handle sorting of large data (for example, larger than your working memory of 16 GB), creating temporary working files on disk.

So, I would recommend just using UNIX sort for this, as you suggested, by invoking the -T tmp_dir option -T tmp_dir and making sure tmp_dir has enough disk space to store all the temporary working files that will be created there.

By the way, this is discussed in the previous SO question.

+8
source

UNIX sort is the best option for sorting data of this scale. I would recommend using the LZO fast compression algorithm for this. It is usually distributed as lzop . Set a large sort buffer using the -S option. If you have a drive faster than when you have the default value /tmp , also set -T . In addition, if you want to sort by number, you must define the sort sort as the second sort field. Therefore, for best performance, you should use this line:

 LC_ALL=C sort -S 90% --compress-program=lzop -k1,1 -k2n 
+1
source

I had the same problem! After a lot of searching, and since I did not want any shell dependency (UNIX) to make it portable on windows, I came up with the following solution:

 #!/usr/bin/perl use File::Sort qw(sort_file); my $src_dic_name = 'C:\STORAGE\PERSONAL\PROJECTS\perl\test.txt'; sort_file({k => 1, t=>" ", I => $src_dic_name, o => $src_dic_name.".sorted"}); 

I know this is an old post, but updating it with a solution so that it is easy to find.

Documentation here

0
source

Source: https://habr.com/ru/post/951575/


All Articles