Regex: match string to pattern (which may not exist)

I am trying to parse an XML document using the regex tokenizer in Python (this is a finite set, so the regex is just fine!) And I am having problems matching comments.

The format of these comments is in the form <!--This is a comment--> , where the comment itself can contain all kinds of non-alphanumeric characters (including '-')

I want to match them in such a way as to break the comment into the following tokens:

<!--

This is a comment

-->

The start marker is easy enough to get, and I successfully grabbed it with another regular expression, but the comment expression itself is too greedy and grabs -- from the end marker. I want this regular expression to capture lines that are also not necessarily included in the comment, so it should also be able to accept <Tag>This is text</Tag> and correctly return This is text .

This is the regular expression that I am currently using for text:

[^<>]+(?!-->)

The end result ends with This is a comment-- when I just want This is a comment so that my other regular expression can capture --> . This regex does work for regular tags, however, because of the existence of a '<' at the end of the tag, it correctly returns This is text from my previous example.

I know that I should not use a negative lookahead correctly. Any ideas on what I'm doing wrong here? I tried [^<>]+(?=-->) , but then this does not correspond to what is not a comment of this form (for example, normal tags). I realized that (?!-->) will stop the match when he sees this pattern, but it does not seem to work like that, but just continues the match until he sees the ending ">".

Posting a code segment for context:

 xml_scanner = re.Scanner([ (r" ", lambda scanner,token:("INDENT", token)), (r"<[A-Za-z\d._]+(?!\/)>", lambda scanner,token:("BEGINTAG", token)), (r"<\/[A-Za-z\d._]+(?!\/)>", lambda scanner,token:("ENDTAG", token)), (r"<[A-Za-z\d._]+\/>", lambda scanner,token:("INLINETAG", token)), (r"<!--", lambda scanner,token:("BEGINCOMMENT", token)), (r"-->", lambda scanner,token:("ENDCOMMENT", token)), (r"[^<>]+(?!-->)", lambda scanner,token:("DATA", token)), (r"\r$", None), ]) for line in database_file: results, remainder = xml_scanner.scan(line) 

This is the only thing the script is doing at the moment.

+4
source share
2 answers

Your problem is that in [^<>]+(?!-->) the lookahead statement is checked only after This is a comment-- been matched, and, of course, it succeeds because in the position, where the regex mechanism appeared, no --> .

Therefore, you must check the lookahead at each position of the line. A custom idiom for this is usually:

 (?:(?!-->).)* 

or, in your case

 (?:(?!-->)[^<>])* 

This matches any number of characters except the angle brackets until either --> occurs or the end of the line is reached.

+5
source

You need to use positive viewing to make sure this template is ahead of your match.

 [^<>]+(?=-->) 

If you want to make it more universal so that it matches the contents of other tags, you can use alternation inside the lookahead statement:

 [^<>]+(?=-->|</) 

See here at Regexr

This regular expression will stop if there is a " --> " or " </ " in front of it.

+5
source

Source: https://habr.com/ru/post/1486027/


All Articles