Perl: comparing two files and print data that match and don't match

For the Perl code below, I need to increase its efficiency, since it takes several hours to process the input files (containing millions of lines of data). Any ideas on how I can speed up the process?

Given two files, I want to compare the data and print those lines that match and those that don't. Note that you must map two columns.

For instance,

input1.txt AB CD input2.txt BA CD EF GH 

Please note: Lines 1 and 2 correspond (interchangeably); Lines 3 and 4 do not match

 Output: BA match CD match EF don't match GH don't match 

Perl Code:

 #!/usr/bin/perl -w use strict; use warnings; open INFH1, "<input1.txt" || die "Error\n"; open INFH2, "<input2.txt" || die "Error\n"; chomp (my @array=<INFH2>); while (<INFH1>) { my @values = split; next if grep /\D/, @values or @values != 2; my $re = qr/\A$values[0]\s+$values[1]\z|\A$values[1]\s+$values[0]\z/; foreach my $temp (@array) { chomp $_; print "$_\n" if grep $_ =~ $re, $temp; } } close INFH1; close INFH2; 1; 

Any ideas on how to make this code more efficient are greatly appreciated. Thanks!

+4
source share
2 answers

If you have enough memory, use a hash. If the characters do not occur several times in the input1.txt file (i.e., if AB is in the file, AX not), then the following should work:

 #!/usr/bin/perl use warnings; use strict; my %hash; open my $F1, '<', 'input1.txt' or die $!; while (<$F1>) { my @values = split / /; @hash{@values} = reverse @values; } close $F1; open my $F2, '<', 'input2.txt' or die $!; while (<$F2>) { my @values = split / /; my $value = $hash{$values[0]}; if ($value and $value eq $values[1]) { print "Matches: $_"; } else { print "Does not match: $_"; } } close $F2; 

Update:

For repeated values, I would use a hash of hashes. Just sort the characters, the first of which will be the key in the large hash, the second will be the key in the subhash:

 #!/usr/bin/perl use warnings; use strict; my %hash; open my $IN1, '<', 'input1.txt' or die $!; while (<$IN1>) { my @values = sort split; undef $hash{$values[0]}{$values[1]}; } close $IN1; open my $IN2, '<', 'input2.txt' or die $!; while (<$IN2>) { chomp; my @values = sort split; if (exists $hash{$values[0]}{$values[1]}) { print "$_ matches\n"; } else { print "$_ doesn't match\n"; } } close $IN2; 
+1
source

for those who are interested in another solution that does not depend on the number of columns:

 #!/usr/bin/perl -w use strict; use warnings; use 5.010; open INFH1, "<", input1.txt" || die "Error\n"; my @storage = map {[sort split]} <$IN1>; # store content as matrix (each row sorted) close INFH1; open INFH2, "<input2.txt" || die "Error\n"; while(<INFH2>) { chomp; if(@{$storage[$.]} ~~ sort split) { # if stored matrix row is elementwise-equal to current line (each row sorted) say "$_ matches"; } else { say "$_ doesn't match"; } } close INFH2; 
0
source

Source: https://habr.com/ru/post/1433768/


All Articles