Match double hyphens in invalid XML comments

I have to parse XML files that do not match the “no double hyphens in comments” standard, which makes MSXML complain, I'm looking for a way to remove offensive hyphens.

I am using StringRegExpReplace() . I tried to execute regular expressions:

 <!--(.*)--> : correctly gets comments <!--(-*)--> : fails to be a correct regex (also tried escaping and using \x2D) 

Given the correct template, I would call:

 StringRegExpReplace($xml_string,$correct_pattern,"") ;replace with nothing 

How do I match the remaining extra hyphens in an XML comment, leaving only the remaining text?

+6
source share
2 answers

You can use this template:

 (?|\G(?!\A)(?|-{2,}+([^->][^-]*)|(-[^-]+)|-+(?=-->)|-->[^<]*(*SKIP)(*FAIL))|[^<]*<+(?>[^<]+<+)*?(?:!--\K|[^<]*\z\K(*ACCEPT))(?|-*+([^->][^-]*)|-+(?=-->)|-?+([^-]+)|-->[^<]*(*SKIP)(*FAIL)())) 

Details:

 (?| \G(?!\A) # contiguous to the precedent match (inside a comment) (?| -{2,}+([^->][^-]*) # duplicate hyphens, not part of the closing sequence | (-[^-]+) # preserve isolated hyphens | -+ (?=-->) # hyphens before closing sequence, break contiguity | -->[^<]* # closing sequence, go to next < (*SKIP)(*FAIL) # break contiguity ) | [^<]*<+ # reach the next < (outside comment) (?> [^<]+ <+ )*? # next < until !-- or the end of the string (?: !-- \K | [^<]*\z\K (*ACCEPT) ) # new comment or end of the string (?| -*+ ([^->][^-]*) # possible hyphens not followed by > | -+ (?=-->) # hyphens before closing sequence, break contiguity | -?+ ([^-]+) # one hyphen followed by > | -->[^<]* # closing sequence, go to next < (*SKIP)(*FAIL) () # break contiguity (note: "()" avoids a mysterious bug ) # in regex101, you can remove it) ) 

With this replacement: \1

online demo

The \G function guarantees matching matches. To break the contact, two methods are used:

  • view (?=-->)
  • backtracking control verbs (*SKIP)(*FAIL) , which cause the pattern to fail and all characters match before they are repeated.

So, when the contact is broken or at the beginning the first main branch will fail (reason for binding \G ), and the second branch will be used.

\K removes everything to the left of the match result.

(*ACCEPT) makes the template unconditional.

This template uses the massive function reset (?|...(..)...|...(..)...|...) , so all capture groups have the same number (in other words, there is only one group, group 1.)

Note. Even this template is long, it takes a few steps to get a match. The influence of non-greedy quantifiers is reduced as much as possible, and each alternative is sorted and most effective. One goal is to reduce the total number of matches needed to process a string.

+4
source
 (?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->) 

matches -- (or ---- etc.) only between <!-- and --> . You need to set the /s so that the dot matches the newlines.

Explanation:

 (?<!<!) # Assert that we're not right at the start of a comment --+ # Match two or more dashes -- (?= # only if the following can be matched further onwards: (?!-?>) # First, make sure we're not at the end of the comment. (?: # Then match the following group (?!-->) # which must not contain --> . # but may contain any character )* # any number of times --> # as long as --> follows. ) # End of lookahead assertion. 

Test it live at regex101.com .

I assume the correct AutoIt syntax will be

 StringRegExpReplace($xml_string, "(?s)(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)", "") 
+3
source

Source: https://habr.com/ru/post/981476/


All Articles