Search indexes of non-empty fields in text files

Question

Search indexes of non-empty fields in text files

I have a very large text file whose lines are comma separated values. Some values are missing. For each row, I would like to print the index and value of all non-empty fields.

For example, a line might look like

,,10.3,,,,5.2,3.1,,,,,,,

in this case I want to choose

 2,10.3,6,5.2,7,3.1

I know how to do this by first splitting the input signal into an array and then going through an array with a for loop, but these are huge files (many gigabytes), and I am wondering if there is a faster way (for example, using some extended regular expression )

+4

regex perl parsing

eyio Feb 13 '12 at 21:00

source share

3 answers

flesk · Answer 1 · 2012-02-13T21:09:18+0000

I have not tested it yet, but I would suggest that

 my $line = ",,10.3,,,,5.2,3.1,,,,,,,"; my $index = 0; print join ",", map {join ",", @$_} grep $_->[1], map {[$index++, $_]} split ",", $line;

faster than some extended regular expression.

The problem is that while you need to know the index, you still have to keep track of those missing entries.

Something like this might not be too slow, though:

 my ($i, @vars); while ($line =~ s/^(,*)([^,]+)//) { push @vars, $i += length($1), $2; } print join ",", @vars;

You can probably leave the first capture group and use pos() to develop the index.

Here is a comparison of my two sentences and sin with 1M iterations:

  Rate flesk1 sin flesk2 flesk1 87336/s -- -8% -27% sin 94518/s 8% -- -21% flesk2 120337/s 38% 27% --

My regex seems to work better than I thought.

sln · Answer 2 · 2012-02-13T21:52:49+0000

You might be able to mix and match regex and code -

$line =~ /(?{($cnt,@ary)=(0,)})^(?:([^,]+)(?{push @ary,$cnt; push @ary,$^N})|,(?{$cnt++}))+/x
and print join( ',', @ary);

deployed -

 $line =~ / (?{($cnt,@ary)=(0,)}) ^(?: ([^,]+) (?{push @ary,$cnt; push @ary,$^N}) | , (?{$cnt++}) )+ /x and print join( ',', @ary);

some landmarks

With a little tweak to flesk and sln (look for fleskNew and slnNew)
the winner is fleskNew when the substitution operator is removed.

code -

 use Benchmark qw( cmpthese ) ; $samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p"; $line = $samp; cmpthese( -5, { flesk1 => sub{ $index = 0; join ",", map {join ",", @$_} grep $_->[1], map {[$index++, $_]} split ",", $line; }, flesk2 => sub{ ($i, @vars) = (0,); while ($line =~ s/^(,*)([^,]+)//) { push @vars, $i += length($1), $2; } $line = $samp; }, fleskNew => sub{ ($i, @vars) = (0,); while ($line =~ /(,*)([^,]+)/g) { push @vars, $i += length($1), $2; } }, sln1 => sub{ $line =~ / (?{($cnt,@ary)=(0,)}) ^(?: ([^,]+) (?{push @ary,$cnt; push @ary,$^N}) | , (?{$cnt++}) )+ /x }, slnNew => sub{ $line =~ / (?{($cnt,@ary)=(0,)}) (?: (,*) (?{$cnt += length($^N)}) ([^,]+) (?{push @ary, $cnt,$^N}) )+ /x }, } );

numbers -

  Rate flesk1 sln1 flesk2 slnNew fleskNew flesk1 20325/s -- -51% -52% -56% -60% sln1 41312/s 103% -- -1% -10% -19% flesk2 41916/s 106% 1% -- -9% -17% slnNew 45978/s 126% 11% 10% -- -9% fleskNew 50792/s 150% 23% 21% 10% --

some tests 2

Adds Birei built-in replication and cropping (all-in-one) solution.

Deviations:

Flesk1 has been modified to remove the final “join” because it is not included in the list of other regular expression solutions. This provides a better bench.

Birei deviates on the bench as he modifies the original row as a final decision.
This aspect cannot be removed. The difference between Birei1 and BireiNew is that the new one removes the final ','.

Flesk2, Birei1 and BireiNew have additional overhead for restoring the original line
due to the lookup operator.

The winner still looks like FleskNew ..

code -

 use Benchmark qw( cmpthese ) ; $samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p"; $line = $samp; cmpthese( -5, { flesk1a => sub{ $index = 0; map {join ",", @$_} grep $_->[1], map {[$index++, $_]} split ",", $line; }, flesk2 => sub{ ($i, @vars) = (0,); while ($line =~ s/^(,*)([^,]+)//) { push @vars, $i += length($1), $2; } $line = $samp; }, fleskNew => sub{ ($i, @vars) = (0,); while ($line =~ /(,*)([^,]+)/g) { push @vars, $i += length($1), $2; } }, sln1 => sub{ $line =~ / (?{($cnt,@ary)=(0,)}) ^(?: ([^,]+) (?{push @ary,$cnt; push @ary,$^N}) | , (?{$cnt++}) )+ /x }, slnNew => sub{ $line =~ / (?{($cnt,@ary)=(0,)}) (?: (,*) (?{$cnt += length($^N)}) ([^,]+) (?{push @ary, $cnt,$^N}) )+ /x }, Birei1 => sub{ $i = -1; $line =~ s/ (?(?=,+) ( (?: , (?{ ++$i }) )+ ) | (?<no_comma> [^,]+ ,? ) (?{ ++$i }) ) / defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[] /xge; $line = $samp; }, BireiNew => sub{ $i = 0; $line =~ s/ (?: , (?{++$i}) )* (?<data> [^,]* ) (?: ,*$ )? (?= (?<trailing_comma> ,?) ) / length $+{data} ? "$i,$+{data}$+{trailing_comma}" : "" /xeg; $line = $samp; }, } );

Results -

  Rate BireiNew Birei1 flesk1a flesk2 sln1 slnNew fleskNew BireiNew 6030/s -- -18% -74% -85% -86% -87% -88% Birei1 7389/s 23% -- -68% -82% -82% -84% -85% flesk1a 22931/s 280% 210% -- -44% -45% -51% -54% flesk2 40933/s 579% 454% 79% -- -2% -13% -17% sln1 41752/s 592% 465% 82% 2% -- -11% -16% slnNew 47088/s 681% 537% 105% 15% 13% -- -5% fleskNew 49563/s 722% 571% 116% 21% 19% 5% --

Birei · Answer 3 · 2012-02-15T11:15:55+0000

Using regex (although I'm sure it might be easier):

 s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;

Explanation:

 s/PATTERN/REPLACEMENT/ge # g -> Apply to all occurrences # e -> Evaluate replacement as a expression. (? (?=,+) # Check for one or more commas. ((?:,(?{ ++$i }))+) # If (?=,+) was true, increment variable '$i' with each comma found. | (?<no_comma>[^,]+,?)(?{ ++$i }) # If (?=,+) was false, get number between comma and increment the $i variable only once. ) defined $+{no_comma} # If 'no_comma' was set in 'pattern' expression... $i . qq[,] . $+{no_comma} # insert the position just before it. qq[] # If wasn't set, it means that pattern matched only commas, so remove then.

My test:

The contents of script.pl :

 use warnings; use strict; while ( <DATA> ) { our $i = -1; chomp; printf qq[Orig = $_\n]; s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge; # s/,\Z//; printf qq[Mod = $_\n\n]; } __DATA__ ,,10.3,,,,5.2,3.1,,,,,,, 10.3,,,,5.2,3.1,,,,,,, ,10.3,,,,5.2,3.1 ,,10.3,5.2,3.1,

Run the script as:

 perl script.pl

And the conclusion:

 Orig = ,,10.3,,,,5.2,3.1,,,,,,, Mod = 2,10.3,6,5.2,7,3.1, Orig = 10.3,,,,5.2,3.1,,,,,,, Mod = 0,10.3,4,5.2,5,3.1, Orig = ,10.3,,,,5.2,3.1 Mod = 1,10.3,5,5.2,6,3.1 Orig = ,,10.3,5.2,3.1, Mod = 2,10.3,3,5.2,4,3.1,

As you can see, it saves the last comma. I do not know how to remove it without additional regular expression, just uncomment s/,\Z//; in the previous code.

Search indexes of non-empty fields in text files

More articles: