Find all possible starting positions of a regular expression in perl, including matching matches?

Question

Find all possible starting positions of a regular expression in perl, including matching matches?

Is there a way to find all possible starting positions for regular expression in perl?

For example, if your regular expression was “aa” and the text was “aaaa,” it returned 0, 1, and 2 instead of, say, 0 and 2.

Obviously, you could just do something like return the first match, and then remove all the characters before and include that starting character and do another search, but I hope for something more efficient.

+4

perl

jonderry Jan 13 '11 at 22:11

source share

4 answers

Update:

I thought about this a little more and came up with this solution using the built-in code block, which is almost three times faster than the grep solution:

 use 5.010; use warnings; use strict; {my @pos; my $push_pos = qr/(?{push @pos, $-[0]})/; sub with_code { my ($re, $str) = @_; @pos = (); $str =~ /(?:$re)$push_pos(?!)/; @pos }}

and for comparison:

 sub with_grep { # old solution my ($re, $str) = @_; grep {pos($str) = $_; $str =~ /\G(?:$re)/} 0 .. length($str) - 1; } sub with_while { # per Michael Carman solution, corrected my ($re, $str) = @_; my @pos; while ($str =~ /\G.*?($re)/) { push @pos, $-[1]; pos $str = $-[1] + 1 } @pos } sub with_look_ahead { # a fragile "generic" version of Sean solution my ($re, $str) = @_; my ($re_a, $re_b) = split //, $re, 2; my @pos; push @pos, $-[0] while $str =~ /$re_a(?=$re_b)/g; @pos }

Tested and verified with:

 use Benchmark 'cmpthese'; my @arg = qw(aa aaaabbbbbbbaaabbbbbaaa); my $expect = 7; for my $sub qw(grep while code look_ahead) { no strict 'refs'; my @got = &{"with_$sub"}(@arg); "@got" eq '0 1 2 11 12 19 20' or die "$sub: @got"; } cmpthese -2 => { grep => sub {with_grep (@arg) == $expect or die}, while => sub {with_while (@arg) == $expect or die}, code => sub {with_code (@arg) == $expect or die}, ahead => sub {with_look_ahead(@arg) == $expect or die}, };

What prints:

  Rate grep while ahead code grep 49337/s -- -20% -43% -65% while 61293/s 24% -- -29% -56% ahead 86340/s 75% 41% -- -38% code 139161/s 182% 127% 61% --

+1

Eric Strom Jan 13 '11 at 10:31

source share

I know that you requested a regex, but there really is a simple built-in function that does something very similar, the index function ( perldoc -f index ). From this, we can create a simple solution for your direct question, although if you really need a more complex search than your example, this will not work, since it only searches for substrings (after the index specified by the third parameter).

 #!/usr/bin/env perl use strict; use warnings; my $str = 'aaaa'; my $substr = 'aa'; my $pos = -1; while (1) { $pos = index($str, $substr, $pos + 1); last if $pos < 0; print $pos . "\n"; }

+1

Joel berger Jan 14 '11 at 3:00

source share

You can use global matching with the pos() function:

 my $s1 = "aaaa"; my $s2 = "aa"; while ($s1 =~ /aa/g) { print pos($s1) - length($s2), "\n"; }

0

Eugene yarmash Jan 13 '11 at 22:28

source share

Sean · Accepted Answer · 2011-01-13T22:27:51+0000

Use lookahead:

$ perl -le 'print $-[0] while "aaaa" =~ /a(?=a)/g'

In general, put everything except the first regular expression character inside (?=...) .

Find all possible starting positions of a regular expression in perl, including matching matches?

More articles: