If-else in recursive regex not working properly

I use a regex to parse some BBCode, so the regex should work recursively to also match the tags inside others. Most BBCode has an argument, and is sometimes quoted, although not always.

The simplified equivalent of the regular expression I use (with html style tags to reduce the need for escaping):

'~<(\")?a(?(1)\1)> #Match the tag, and require a closing quote if an opening one provided ([^<]+ | (?R))* #Match the contents of the tag, including recursively </a>~x' 

However, if I have a test line that looks like this:

 <"a">Content<a>Also Content</a></a> 

it matches only the <a>Also Content</a> character, because when it tries to match the first tag, the first matching group \1 set to " , and this is not , overwritten when the regular expression is run recursively to match the internal tag , which means that since it is not quoted, it does not match and that regex is not working.

If instead I use or do not use quotation marks sequentially, it works fine, but I cannot be sure that this will be the case with the content that I need to parse. Is there any way around this?


The full regex that I use to match [spoiler]content[/spoiler] , [spoiler=option]content[/spoiler] and [spoiler="option"]content[/spoiler] ,

 "~\[spoiler\s*+ #Match the opening tag (?:=\s*+(\"|\')?((?(1)(?!\\1).|[^\]]){0,100})(?(1)\\1))?+\s*\] #If an option exists, match that (?:\ *(?:\n|<br />))?+ #Get rid of an extra new line before the start of the content if necessary ((?:[^\[\n]++ #Capture all characters until the closing tag |\n(?!\[spoiler]) Capture new line separately so backtracking doesn't run away due to above |\[(?!/?spoiler(?:\s*=[^\]*])?) #Also match all tags that aren't spoilers |(?R))*+) #Allow the pattern to recurse - we also want to match spoilers inside spoilers, # without messing up nesting \n? #Get rid of an extra new line before the closing tag if necessary \[/spoiler] #match the closing tag ~xi" 

There are a few other mistakes with him, though.

+6
source share
2 answers

The simplest solution is to use alternatives instead:

 <(?:a|"a")> ([^<]++ | (?R))* </a> 

But if you really don't want to repeat this part of a , you can do the following:

 <("?)a\1> ([^<]++ | (?R))* </a> 

Demo

I just put a conditional ? inside the group. This time, the capture group always matches, but the match may be empty, and the conditional value is no longer needed.

Side note: I applied possessive quantifier to [^<] to avoid catastrophic backtracking .


In your case, I think it's better to match a common tag than a specific one. Match all the tags, and then determine in your code what to do with the match.

Here's the full regex:

 \[ (?<tag>\w+) \s* (?:=\s* (?: (?<quote>["']) (?<arg>.{0,100}?) \k<quote> | (?<arg>[^\]]+) ) )? \] (?<content> (?:[^[]++ | (?R) )*+ ) \[/\k<tag>\] 

Demo

Note that I added the J parameter ( PCRE_DUPNAMES ) to use (?<arg> ... ) twice.

+3
source

(?(1)...) only checks if group 1 has been defined, so the condition is true after the group is defined for the first time. That is why you get this result (it is not related to the recursion level or anything else).

So, when <a> reached in recursion, the regex mechanism tries to combine <a "> and fails.

If you want to use the conditional operator, you can write <("?)a(?(1)\1)> . Thus, group 1 is redefined every time.

Obviously, you can write your template in a more efficient way, for example:

 ~<(?:a|"a")>[^<]*+(?:(?R)[^<]*)*+</a>~ 

For your specific problem, I will use this type of template to match any tags:

 $pattern = <<<'EOD' ~ \[ (?<tag>\w+) \s* (?: = \s* (?| " (?<option>[^"]*) " | ' ([^']*) ' | ([^]\s]*) ) # branch reset feature )? \s* ] (?<content> [^[]*+ (?: (?R) [^[]*)*+ ) \[/\g{tag}] ~xi EOD; 

If you want to overlay a specific tag at ground level, you can add (?(R)|(?=spoiler\b)) before the tag name.

+1
source

Source: https://habr.com/ru/post/989856/


All Articles