How to handle nested parentheses with regular expressions?

I came up with a regex line that parses this text into 3 categories:

  • in parentheses
  • in brackets
  • neither.

Like this:

\[.+?\]|\(.+?\)|[\w+ ?]+ 

I intend to use only the external operator. So, given a(b[c]d)e , the split will be:

 a || (b[c]d) || e 

It works great with given brackets inside brackets or brackets inside parentheses, but it breaks when brackets inside brackets and brackets are enclosed in brackets. For example, a[b[c]d]e is divided as

 a || [b[c] || d || ] || e. 

Is there a way to handle this with a regular expression without resorting to using code to count the number of open / closed parentheses? Thanks!

+4
source share
2 answers

Standard regular expressions 1 are not complex enough to match nested structures. The best way to get close to this is probably crossing the line and tracking open / close pairs.


1 I said the standard, but not all regex engines are really standard. You could do this with Perl, for example, using recursive regular expressions. For instance:

 $str = "[hello [world]] abc [123] [xyz jkl]"; my @matches = $str =~ /[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+ \] /gx; foreach (@matches) { print "$_\n"; } 
  [hello [world]]
 abc
 [123]
 [xyz jkl]

EDIT: I see that you are using Python; check out pyparsing .

+9
source

Well, as soon as you give up the idea that parsing nested expressions should work at unlimited depth, you can simply use regular expressions, specifying the maximum depth in advance. Here's how:

 def nested_matcher (n): # poor man matched paren scanning, gives up after n+1 levels. # Matches any string with balanced parens or brackets inside; add # the outer parens yourself if needed. Nongreedy. Does not # distinguish parens and brackets as that would cause the # expression to grow exponentially rather than linearly in size. return "[^][()]*?(?:[([]"*n+"[^][()]*?"+"[])][^][()]*?)*?"*n import re p = re.compile('[^][()]+|[([]' + nested_matcher(10) + '[])]') print p.findall('a(b[c]d)e') print p.findall('a[b[c]d]e') print p.findall('[hello [world]] abc [123] [xyz jkl]') 

This will lead to the conclusion

 ['a', '(b[c]d)', 'e'] ['a', '[b[c]d]', 'e'] ['[hello [world]]', ' abc ', '[123]', ' ', '[xyz jkl]'] 
0
source

Source: https://habr.com/ru/post/1488893/


All Articles