RegEx: text immediately after the last open parenthesis

Question

RegEx: text immediately after the last open parenthesis

I know a little about RegEx, but at the moment it is much higher than my capabilities.

I need help finding a text / expression immediately after the last open parenthesis that does not have matching close brackets.

This is for CallTip open source software (Object Pascal) under development.

Below are some examples:

------------------------------------ Text I need ------------------------------------ aaa(xxx xxx aaa(xxx, xxx aaa(xxx, yyy xxx aaa(y=bbb(xxx) y=bbb(xxx) aaa(y <- bbb(xxx) y <- bbb(xxx) aaa(bbb(ccc(xxx xxx aaa(bbb(x), ccc(xxx xxx aaa(bbb(x), ccc(x) bbb(x) aaa(bbb(x), ccc(x), bbb(x) aaa(?, bbb(?? ?? aaa(bbb(x), ccc(x)) '' aaa(x) '' aaa(bbb( '' ------------------------------------ For all text above the RegEx proposed by @Bohemian (?<=\()(?=([^()]*\([^()]*\))*[^()]*$).*?(?=[ ,]|$)(?! <-)(?<! <-) matches all cases. For the below (I found these cases when implementing the RegEx in the software) not ------------------------------------ New text I need ------------------------------------ aaa(bbb(x, y) bbb(x, y) aaa(bbb(x, y, z) bbb(x, y, z) ------------------------------------

Is it possible to write RegEx (PCRE) for these situations?

In a previous post ( RegEx: Word just before the last open parenthesis ) Alan Moore (thanks a lot) helped me find the text just before the last opening -parenthesis with RegEx below:

 \w+(?=\((?:[^()]*\([^()]*\))*[^()]*$)

However, I could not make the appropriate settings to match right after.

Can anybody help?

+4

regex object-pascal

jcfaria Jun 12 '13 at 12:24

source share

2 answers

It looks like this problem . And since you are using PCRE using the recursion syntax, there really is a solution.

 / (?(DEFINE) # define a named capture for later convenience (?P<parenthesized> # define the group "parenthesized" which matches a # substring which contains correctly nested # parentheses (it does not have to be enclosed in # parentheses though) [^()]* # match arbitrarily many non-parenthesis characters (?: # start non capturing group [(] # match a literal opening ( (?P>parenthesized) # recursively call this "parenthesized" subpattern # ie make sure that the contents of these literal () # are also correctly parenthesized [)] # match a literal closing ) [^()]* # match more non-parenthesis characters )* # repeat ) # end of "parenthesized" pattern ) # end of DEFINE sequence # Now the actual pattern begins (?<=[(]) # ensure that there is a literal ( left of the start # of the match (?P>parenthesized)? # match correctly parenthesized substring $ # ensure that we've reached the end of the input /x # activate free-spacing mode

The essence of this template is obviously a parenthesized subpattern. Maybe I should think about it. This is the structure:

 (normal* (?:special normal*)*)

Where normal is [^()] and special is [(](?P>parenthesized)[)] . This method is called "unrolling-the-loop" . It is used to match what is structured.

 nnnsnnsnnnnsnnsnn

Where n corresponds to normal and s corresponds to special .

In this particular case, everything is a little more complicated, because we also use recursion. (?P>parenthesized) recursively uses the parenthesized pattern (which is part). The syntax (?P>...) can be considered as a backlink, except that the engine does not try to match what corresponds to the group ... but instead uses its subformat again.

Also note that my template will not give you an empty string for the correct patterns in brackets, but will fail. You can fix it by leaving a glance. Lookbehind is not really required because the engine will always return the left-most match.

EDIT: Judging by your two examples, you really don't want everything after the last inconsistent bracket, but only up to the first comma. You can use my result and divide by , or try the Czech answer.

Further reading:

PCRE sub-formats (including named groups)
PCRE Recursion
"Unrolling-the-loop" was introduced by Jeffrey Friedl in his book , Mastering Regular Expressions , but I think the post I linked above gives a good overview.
Using (?(DEFINE)...) actually abuses another function called conditional patterns . PCRE character pages explain how this works - just find the pages for "Defining Subpatterns for Reference Only Use".

EDIT: I noticed that you mentioned in your question that you were using Object Pascal. In this case, you probably are not using PCRE, which means there is no support for recursion. In this case, there cannot be a full regular expression of the problem. If we impose a restriction such as “after the last inconsistent bracket” there can be only one level of nesting (as in all your examples), then we can come up with a solution. Again, I will use "unrolling-the-loop" to match substrings of the form xxx(xxx)xxx(xxx)xxx .

 (?<=[(]) # make sure we start after an opening ( (?= # lookahead checks that the parenthesis is not matched [^()]*([(][^()]*[)][^()]*)* # this matches an arbitrarily long chain of parenthesized # substring, but allows only one nesting level $ # make sure we can reach the end of the string like this ) # end of lookahead [^(),]*([(][^()]*[)][^(),]*)* # now actually match the desired part. this is the same # as the lookahead, except we do not allow for commas # outside of parentheses now, so that you only get the # first comma-separated part

If you ever add an input example, for example aaa(xxx(yyy()) , where you want to match xxx(yyy()) , then this approach will not match it. In fact, no regular expression that uses recursion , can handle arbitrary levels of nesting.

Since your regular expression flavor does not support recursion, you are probably better off not using regular expression at all. Even if my last regular expression matches all your current input examples, it is really confusing and maybe not worth the trouble. How about this: go through the string character by character and save a stack of brackets. Then the following pseudo code gives you everything after the last unsurpassed ( :

 while you can read another character from the string if that character is "(", push the current position onto the stack if that character is ")", pop a position from the stack # you've reached the end of the string now if the stack is empty, there is no match else the top of the stack is the position of the last unmatched parenthesis; take a substring from there to the end of the string

Then, to get everything up to the first unprotected comma, you can continue this result again:

 nestingLevel = 0 while you can read another character from the string if that character is "," and nestingLevel == 0, stop if that character is "(" increment nestingLevel if that character is ")" decrement nestingLevel take a substring from the beginning of the string to the position at which you left the loop

These two short cycles will be much easier for any other to understand in the future and much more flexible than a regular solution (at least one without recursion).

+6

Martin ender Jun 12 '13 at 12:52

source share

Bohemian · Accepted Answer · 2013-06-12T13:18:36+0000

Use the look ahead:

 (?<=\()(?=([^()]*\([^()]*\))*[^()]*$).*?(\(.*?\))?(?=[ ,]|$)(?! <-)(?<! <-)

See this works in rubular , passing all the test cases published in the question.

RegEx: text immediately after the last open parenthesis

More articles: