How to combine lines in a large, unsorted file without running out of memory in Perl?

Question

How to combine lines in a large, unsorted file without running out of memory in Perl?

I have a very large column delimited file coming out of a database report something like this:

field1,field2,field3,metricA,value1
field1,field2,field3,metricB,value2

I want the new file to have combined lines like this so it looks something like this:

field1,field2,field3,value1,value2

I can do this using a hash. In this example, the first three fields are the key, and I combine value1 and value in a specific order as value. After I read in the file, I just print the hash table keys and values in another file. It works great.

However, I have some problems, as my file will be very large. About 8 GB per file.

Would there be a more efficient way to do this? I do not think about speed, but in terms of memory size. I am concerned that this process may die due to memory problems. I just draw a space in terms of a solution that will work, but it will not cram everything, ultimately a very large hash.

For full disclosure, I use ActiveState Perl on Windows.

+3

file memory perl hash

geoffrobinson Feb 13 '09 at 20:08

source share

5 answers

Unix (, cygwin), , sort ( ). , , .

- , .

+5

Nick Fortescue 13 . '09 20:20

, , :

use BerkeleyDB;
tie my %data, 'BerkeleyDB::Hash', -Filename => 'data';

while(my $line = <>){
    chomp $line;
    my @columns = split /,/, $line; # or use Text::CSV_XS to parse this correctly

    my $key = join ',', @columns[0..2];
    my $a_key = "$key:metric_a";
    my $b_key = "$key:metric_b";

    if($columns[3] eq 'A'){
        $data{$a_key} = $columns[4];
    }
    elsif($columns[3] eq 'B'){
        $data{$b_key} = $columns[4];
    }

    if(exists $data{$a_key} && exists $data{$b_key}){
        my ($a, $b) = map { $data{$_} } ($a_key, $b_key);
        print "$key,$a,$b\n";
        # optionally delete the data here, if you don't plan to reuse the database
    }
}

+5

jrockway 14 . '09 2:38

It would not be better to do another export directly from the database to your new file, and not recycle the file that you have already released. If this is an option, I would go along this route.

+3

Chris ballance Feb 13 '09 at 20:13

source share

You can try something with Sort :: External . This reminds me of the kind of mainframe that you can use directly in the program logic. It worked very well for what I used it for.

+2

Axeman Feb 13 '09 at 22:30

source share

Joel Hoffman · Accepted Answer · 2009-02-13T20:17:36+0000

If your rows are sorted by key, or for some other reason the equal values of field1, field2, field 3 are adjacent, then the state machine will be much faster. Just read the lines, and if the fields match the previous line, emit both values.

, , - .

How to combine lines in a large, unsorted file without running out of memory in Perl?

More articles: