How to combine lines in a large, unsorted file without running out of memory in Perl?

I have a very large column delimited file coming out of a database report something like this:

field1,field2,field3,metricA,value1
field1,field2,field3,metricB,value2

I want the new file to have combined lines like this so it looks something like this:

field1,field2,field3,value1,value2

I can do this using a hash. In this example, the first three fields are the key, and I combine value1 and value in a specific order as value. After I read in the file, I just print the hash table keys and values ​​in another file. It works great.

However, I have some problems, as my file will be very large. About 8 GB per file.

Would there be a more efficient way to do this? I do not think about speed, but in terms of memory size. I am concerned that this process may die due to memory problems. I just draw a space in terms of a solution that will work, but it will not cram everything, ultimately a very large hash.

For full disclosure, I use ActiveState Perl on Windows.

+3
source share
5 answers

If your rows are sorted by key, or for some other reason the equal values ​​of field1, field2, field 3 are adjacent, then the state machine will be much faster. Just read the lines, and if the fields match the previous line, emit both values.

, , - .

+6

Unix (, cygwin), , sort ( ). , , .

- , .

+5

, , :

use BerkeleyDB;
tie my %data, 'BerkeleyDB::Hash', -Filename => 'data';

while(my $line = <>){
    chomp $line;
    my @columns = split /,/, $line; # or use Text::CSV_XS to parse this correctly

    my $key = join ',', @columns[0..2];
    my $a_key = "$key:metric_a";
    my $b_key = "$key:metric_b";

    if($columns[3] eq 'A'){
        $data{$a_key} = $columns[4];
    }
    elsif($columns[3] eq 'B'){
        $data{$b_key} = $columns[4];
    }

    if(exists $data{$a_key} && exists $data{$b_key}){
        my ($a, $b) = map { $data{$_} } ($a_key, $b_key);
        print "$key,$a,$b\n";
        # optionally delete the data here, if you don't plan to reuse the database
    }
}
+5

It would not be better to do another export directly from the database to your new file, and not recycle the file that you have already released. If this is an option, I would go along this route.

+3
source

You can try something with Sort :: External . This reminds me of the kind of mainframe that you can use directly in the program logic. It worked very well for what I used it for.

+2
source

Source: https://habr.com/ru/post/1703199/


All Articles