How to determine how many capture groups are in Perl Regexp?

Question

How to determine how many capture groups are in Perl Regexp?

I have a bunch of perl regexp in a script. I would like to know how many capture groups in them. More precisely, I would like to know how many elements will be added to the @ and @ + arrays if they match before using them in a real match.

Example:

'XXAB(CD)DE\FG\XX' =~ /(?i)x(ab)\(cd\)(?:de)\\(fg\\)x/ and print "'@-', '@+'\n";

In this case, the output is:

 '1 2 11', '15 4 14'

So, after matching, I know that the 0th element is the matching part of the string, and there are two capturing group expressions. Is it possible to know right before the actual match?

I tried to focus on opening brackets. So first I removed the "\\" patterns to make it easier to detect escaped brackets. Then I deleted '\ (' stringings Then Then '(?'. Now I can count the remaining opening brackets.

 my $re = '(?i)x(ab)\(cd\)(?:de)\\\\(fg\\\\)x'; print "ORIG: '$re'\n"; 'XXAB(CD)DE\FG\XX' =~ /$re/ and print "RE: '@-', '@+'\n"; $re =~ s/\\\\//g; print "\\\\: '$re'\n"; $re =~ s/\\\(//g; print "\\(: '$re'\n"; $re =~ s/\(\?//g; print "\\?: '$re'\n"; my $n = ($re =~ s/\(//g); print "n=$n\n";

Output:

 ORIG: '(?i)x(ab)\(cd\)(?:de)\\(fg\\)x' RE: '1 2 11', '15 4 14' \\: '(?i)x(ab)\(cd\)(?:de)(fg)x' \(: '(?i)x(ab)cd\)(?:de)(fg)x' \?: 'i)x(ab)cd\):de)(fg)x' n=2

So, I know that 2 capture groups are in this regexp . But there may be a simpler way, and this is definitely not complete (for example, this refers to (?<foo>...) and (?'foo'...) as non-caput groups).

Another way would be to reset the internal data structures of the regcomp function. Perhaps the Regexp :: Debugger package may solve the problem, but I do not have the right to install packages in my environment.

In fact, regexp are the keys to some ARRAY refs, and I would like to check if the ARRAY link contains the correct number of values before applying regexp s. Of course, this check can be performed immediately after matching with the sample, but it would be better if I could do this at the script loading stage.

Thank you for your help and comments!

+6

regex perl

Truey Jan 19 '17 at 13:50

source share

3 answers

Regex:

 \\.(*SKIP)(?!)|\((?(?=\?)\?(P?['<]\w+['>]))

Explanation:

 \\. # Match any escaped character (*SKIP)(?!) # Discard it | # OR \( # Match a single `(` (?(?=\?) # Which if is followed by `?` \? # Match `?` P?['<]\w+['>] # Next characters should be matched as ?P'name', ?<name> or ?'name' ) # End of conditional statement

Perl:

 my @offsets = (); while ('XXAB(CD)DE\FG\X(X)' =~ /\\.(*SKIP)(?!)|\((?(?=\?)\?(P?['<]\w+['>]))/g){ push @offsets, "$-[0]"; } print join(", ", @offsets);

Output:

 4, 15

What is the existence of two capture groups in the input string.

+1

revo Jan 19 '17 at 22:00

source share

Without any restrictions on the occurrence of regular expressions, I think there is no definitive answer to the number of capture groups. Just think of alternatives with a different number of capture groups and the possibility of repeating this event in each branch:

 my $re = qr/ A(B)C | A(D|(E(G+|H))F /x;

There can be up to three capture groups in this regular expression. You can recursively analyze each branch and take the highest number as the result - but I honestly can't think of a practical way to do this in a short time. For "linear" regular expressions that do not use alternatives or uncharacteristic functions of regular expressions, the task of determining the number of capture groups is possible, but I do not think that this is possible with more advanced ones.

+1

SREagle Jan 24 '17 at 10:56

source share

Truey · Accepted Answer · 2017-03-01T23:16:07+0000

Like Mr. Obama said: "Yes, we can!"

I found a solution that does not require an additional module and handles all possible events of the capture group (as I know). Since Ikegami mentions that he needs regular expression repair, but perl does this for us.

While digging the Perl modules on the CPAN in the haystack, I found a very interesting warnings :: regex :: recompile . It generates a warning message every time regexp is recompiled. Analyzing the source, I found a solution to my problem.

Using use re qw/Debug DUMP/; Perl returns the parsed regular expression to STDERR . In the source module, the result is dumped to the real file and then reread for processing. I changed the code to use in memory.

My decision:

 sub dumpre { use re qw(eval Debug DUMP); my $buf = ''; open OLDERR, '>&', STDERR or die "$!"; close STDERR or die "$!"; open STDERR, '>', \$buf or die "$!"; my $re = qr/$_[0]/; close STDERR or die "$!"; open STDERR, '>&', OLDERR or die "$!"; close OLDERR or die "$!"; no re 'debug'; # Needed because of split return [ split '\n', $buf ]; }

This function enables DUMP when compiling a regular expression. Allows eval process expressions (?{...}) and (??{...}) .

 my $re = 'aa(?:(a\d)+x)?((b\d)*d)*c*(d\d)?(e*)((f)+)(g)+'; my $r = dumpre $re; print join "\n", @$r;

Result:

 Compiling REx "aa(?:(a\d)+x)?((b\d)*d)*c*(d\d)?(e*)((f)+)(g)+" Final program: 1: EXACT <aa> (3) 3: CURLYX[0] {0,1} (19) 5: CURLYM[1] {1,32767} (16) 9: EXACT <a> (11) 11: POSIXU[\d] (14) 14: SUCCEED (0) 15: NOTHING (16) 16: EXACT <x> (18) 18: WHILEM (0) 19: NOTHING (20) 20: CURLYX[1] {0,32767} (40) 22: OPEN2 (24) 24: CURLYM[3] {0,32767} (35) 28: EXACT <b> (30) 30: POSIXU[\d] (33) 33: SUCCEED (0) 34: NOTHING (35) 35: EXACT <d> (37) 37: CLOSE2 (39) 39: WHILEM[1/7] (0) 40: NOTHING (41) 41: STAR (44) 42: EXACT <c> (0) 44: CURLYM[4] {0,1} (55) 48: EXACT <d> (50) 50: POSIXU[\d] (53) 53: SUCCEED (0) 54: NOTHING (55) 55: OPEN5 (57) 57: STAR (60) 58: EXACT <e> (0) 60: CLOSE5 (62) 62: OPEN6 (64) 64: CURLYN[7] {1,32767} (74) 66: NOTHING (68) 68: EXACT <f> (0) 72: WHILEM (0) 73: NOTHING (74) 74: CLOSE6 (76) 76: CURLYN[8] {1,32767} (86) 78: NOTHING (80) 80: EXACT <g> (0) 84: WHILEM (0) 85: NOTHING (86) 86: END (0) anchored "aa" at 0 floating "fg" at 2..9223372036854775807 (checking floating) minlen 4

Thus, lines with OPEN\d+ , CURLYM[\d+] , CURLYN[\d+] show exciting bracket expressions (line syntax: segment_no: regex command (next segment)). (Note: CURLYX is not a capturing bracket expression like (?: ...) +). The number after OPEN / CURLY [MN} indicates the sequence number of the capture group. The last to be found. In this case, it is 8.

Unfortunately, it does not process if (??{...}) returns the expression in brackets, but now it is not very important for me. I assume that the format is not fixed, so it may differ from version to version. But this is normal for me.

How to determine how many capture groups are in Perl Regexp?

More articles: