How to determine how many capture groups are in Perl Regexp?

I have a bunch of in a script. I would like to know how many capture groups in them. More precisely, I would like to know how many elements will be added to the @ and @ + arrays if they match before using them in a real match.

Example:

'XXAB(CD)DE\FG\XX' =~ /(?i)x(ab)\(cd\)(?:de)\\(fg\\)x/ and print "'@-', '@+'\n"; 

In this case, the output is:

 '1 2 11', '15 4 14' 

So, after matching, I know that the 0th element is the matching part of the string, and there are two capturing group expressions. Is it possible to know right before the actual match?

I tried to focus on opening brackets. So first I removed the "\\" patterns to make it easier to detect escaped brackets. Then I deleted '\ (' stringings Then Then '(?'. Now I can count the remaining opening brackets.

 my $re = '(?i)x(ab)\(cd\)(?:de)\\\\(fg\\\\)x'; print "ORIG: '$re'\n"; 'XXAB(CD)DE\FG\XX' =~ /$re/ and print "RE: '@-', '@+'\n"; $re =~ s/\\\\//g; print "\\\\: '$re'\n"; $re =~ s/\\\(//g; print "\\(: '$re'\n"; $re =~ s/\(\?//g; print "\\?: '$re'\n"; my $n = ($re =~ s/\(//g); print "n=$n\n"; 

Output:

 ORIG: '(?i)x(ab)\(cd\)(?:de)\\(fg\\)x' RE: '1 2 11', '15 4 14' \\: '(?i)x(ab)\(cd\)(?:de)(fg)x' \(: '(?i)x(ab)cd\)(?:de)(fg)x' \?: 'i)x(ab)cd\):de)(fg)x' n=2 

So, I know that 2 capture groups are in this regexp . But there may be a simpler way, and this is definitely not complete (for example, this refers to (?<foo>...) and (?'foo'...) as non-caput groups).

Another way would be to reset the internal data structures of the regcomp function. Perhaps the Regexp :: Debugger package may solve the problem, but I do not have the right to install packages in my environment.

In fact, are the keys to some ARRAY refs, and I would like to check if the ARRAY link contains the correct number of values ​​before applying regexp s. Of course, this check can be performed immediately after matching with the sample, but it would be better if I could do this at the script loading stage.

Thank you for your help and comments!

+6
source share
3 answers

Like Mr. Obama said: "Yes, we can!"

I found a solution that does not require an additional module and handles all possible events of the capture group (as I know). Since Ikegami mentions that he needs regular expression repair, but does this for us.

While digging the Perl modules on the CPAN in the haystack, I found a very interesting warnings :: regex :: recompile . It generates a warning message every time regexp is recompiled. Analyzing the source, I found a solution to my problem.

Using use re qw/Debug DUMP/; Perl returns the parsed regular expression to STDERR . In the source module, the result is dumped to the real file and then reread for processing. I changed the code to use in memory.

My decision:

 sub dumpre { use re qw(eval Debug DUMP); my $buf = ''; open OLDERR, '>&', STDERR or die "$!"; close STDERR or die "$!"; open STDERR, '>', \$buf or die "$!"; my $re = qr/$_[0]/; close STDERR or die "$!"; open STDERR, '>&', OLDERR or die "$!"; close OLDERR or die "$!"; no re 'debug'; # Needed because of split return [ split '\n', $buf ]; } 

This function enables DUMP when compiling a regular expression. Allows eval process expressions (?{...}) and (??{...}) .

 my $re = 'aa(?:(a\d)+x)?((b\d)*d)*c*(d\d)?(e*)((f)+)(g)+'; my $r = dumpre $re; print join "\n", @$r; 

Result:

 Compiling REx "aa(?:(a\d)+x)?((b\d)*d)*c*(d\d)?(e*)((f)+)(g)+" Final program: 1: EXACT <aa> (3) 3: CURLYX[0] {0,1} (19) 5: CURLYM[1] {1,32767} (16) 9: EXACT <a> (11) 11: POSIXU[\d] (14) 14: SUCCEED (0) 15: NOTHING (16) 16: EXACT <x> (18) 18: WHILEM (0) 19: NOTHING (20) 20: CURLYX[1] {0,32767} (40) 22: OPEN2 (24) 24: CURLYM[3] {0,32767} (35) 28: EXACT <b> (30) 30: POSIXU[\d] (33) 33: SUCCEED (0) 34: NOTHING (35) 35: EXACT <d> (37) 37: CLOSE2 (39) 39: WHILEM[1/7] (0) 40: NOTHING (41) 41: STAR (44) 42: EXACT <c> (0) 44: CURLYM[4] {0,1} (55) 48: EXACT <d> (50) 50: POSIXU[\d] (53) 53: SUCCEED (0) 54: NOTHING (55) 55: OPEN5 (57) 57: STAR (60) 58: EXACT <e> (0) 60: CLOSE5 (62) 62: OPEN6 (64) 64: CURLYN[7] {1,32767} (74) 66: NOTHING (68) 68: EXACT <f> (0) 72: WHILEM (0) 73: NOTHING (74) 74: CLOSE6 (76) 76: CURLYN[8] {1,32767} (86) 78: NOTHING (80) 80: EXACT <g> (0) 84: WHILEM (0) 85: NOTHING (86) 86: END (0) anchored "aa" at 0 floating "fg" at 2..9223372036854775807 (checking floating) minlen 4 

Thus, lines with OPEN\d+ , CURLYM[\d+] , CURLYN[\d+] show exciting bracket expressions (line syntax: segment_no: regex command (next segment)). (Note: CURLYX is not a capturing bracket expression like (?: ...) +). The number after OPEN / CURLY [MN} indicates the sequence number of the capture group. The last to be found. In this case, it is 8.

Unfortunately, it does not process if (??{...}) returns the expression in brackets, but now it is not very important for me. I assume that the format is not fixed, so it may differ from version to version. But this is normal for me.

0
source

Regex:

 \\.(*SKIP)(?!)|\((?(?=\?)\?(P?['<]\w+['>])) 

Explanation:

 \\. # Match any escaped character (*SKIP)(?!) # Discard it | # OR \( # Match a single `(` (?(?=\?) # Which if is followed by `?` \? # Match `?` P?['<]\w+['>] # Next characters should be matched as ?P'name', ?<name> or ?'name' ) # End of conditional statement 

Perl:

 my @offsets = (); while ('XXAB(CD)DE\FG\X(X)' =~ /\\.(*SKIP)(?!)|\((?(?=\?)\?(P?['<]\w+['>]))/g){ push @offsets, "$-[0]"; } print join(", ", @offsets); 

Output:

 4, 15 

What is the existence of two capture groups in the input string.

+1
source

Without any restrictions on the occurrence of regular expressions, I think there is no definitive answer to the number of capture groups. Just think of alternatives with a different number of capture groups and the possibility of repeating this event in each branch:

 my $re = qr/ A(B)C | A(D|(E(G+|H))F /x; 

There can be up to three capture groups in this regular expression. You can recursively analyze each branch and take the highest number as the result - but I honestly can't think of a practical way to do this in a short time. For "linear" regular expressions that do not use alternatives or uncharacteristic functions of regular expressions, the task of determining the number of capture groups is possible, but I do not think that this is possible with more advanced ones.

+1
source

Source: https://habr.com/ru/post/1014224/


All Articles