A regular expression that uses balancing groups

I have a basic text template engine that uses syntax like this:

foo bar %IF MY_VAR some text %IF OTHER_VAR some other text %ENDIF %ENDIF bar foo 

I have a problem with a regex, which I use to parse it, in which it does not take into account IF / ENDIF nested blocks.

Current current regular expression: %IF (?<Name>[\w_]+)(?<Contents>.*?)%ENDIF

I read about balancing capture groups (the .NET regex library function), as I understand that this is the recommended way to support recursive regular expression in .NET.

I played with balancing groups and still came up with the following:

 ( ( (?'Open'%IF\s(?<Name>[\w_]+)) (?<Contents>.*?) )+ ( (?'Close-Open'%ENDIF)(?<Remainder>.*?) )+ )* (?(Open)(?!)) 

But this does not behave as I would expect. This, for example, the capture of many empty groups. Help?

+1
source share
1 answer

To capture the entire IF / ENDIF block using balanced IF statements, you can use this regex:

 %IF\s+(?<Name>\w+) (?<Contents> (?> #Possessive group, so . will not match IF/ENDIF \s| (?<IF>%IF)| #for IF, push (?<-IF>%ENDIF)| #for ENDIF, pop . # or, anything else, but don't allow )+ (?(IF)(?!)) #fail on extra open IFs ) #/Contents %ENDIF 

The point here is this: you cannot commit to more than one of each named group in one Match . You will get only one group (?<Name>\w+) , for example, the last committed value. In my regular expression, I saved the Name and Contents groups of your simple regular expression and limited the balancing within the Contents group - the regular expression is still wrapped in IF and ENDIF .

If it gets interesting when your data is more complex. For instance:

 %IF MY_VAR some text %IF OTHER_VAR some other text %ENDIF %IF OTHER_VAR2 some other text 2 %ENDIF %ENDIF %IF OTHER_VAR3 some other text 3 %ENDIF 

You will get two matches here: one for MY_VAR and one for OTHER_VAR3 . If you want to write two ifs files to MY_VAR , you need to re-run the regular expression in your Contents group (you can bypass it using lookahead, if you must - wrap the whole regular expression in (?=...) , but you need some -to bring it into a logical structure using positions and lengths).

Now I will not explain too much, because it seems that you are getting the basics, but a short note about the content group. I use a possessive group to avoid backtracking. Otherwise, the point could eventually match the whole IF and break the balance. Len coincidence in the group will behave similarly to ( ( )+? Instead of (?> )+ ).

+5
source

Source: https://habr.com/ru/post/1484664/


All Articles