I am trying to parse an XML document using the regex tokenizer in Python (this is a finite set, so the regex is just fine!) And I am having problems matching comments.
The format of these comments is in the form <!--This is a comment--> , where the comment itself can contain all kinds of non-alphanumeric characters (including '-')
I want to match them in such a way as to break the comment into the following tokens:
<!--
This is a comment
-->
The start marker is easy enough to get, and I successfully grabbed it with another regular expression, but the comment expression itself is too greedy and grabs -- from the end marker. I want this regular expression to capture lines that are also not necessarily included in the comment, so it should also be able to accept <Tag>This is text</Tag> and correctly return This is text .
This is the regular expression that I am currently using for text:
[^<>]+(?!-->)
The end result ends with This is a comment-- when I just want This is a comment so that my other regular expression can capture --> . This regex does work for regular tags, however, because of the existence of a '<' at the end of the tag, it correctly returns This is text from my previous example.
I know that I should not use a negative lookahead correctly. Any ideas on what I'm doing wrong here? I tried [^<>]+(?=-->) , but then this does not correspond to what is not a comment of this form (for example, normal tags). I realized that (?!-->) will stop the match when he sees this pattern, but it does not seem to work like that, but just continues the match until he sees the ending ">".
Posting a code segment for context:
xml_scanner = re.Scanner([ (r" ", lambda scanner,token:("INDENT", token)), (r"<[A-Za-z\d._]+(?!\/)>", lambda scanner,token:("BEGINTAG", token)), (r"<\/[A-Za-z\d._]+(?!\/)>", lambda scanner,token:("ENDTAG", token)), (r"<[A-Za-z\d._]+\/>", lambda scanner,token:("INLINETAG", token)), (r"<!--", lambda scanner,token:("BEGINCOMMENT", token)), (r"-->", lambda scanner,token:("ENDCOMMENT", token)), (r"[^<>]+(?!-->)", lambda scanner,token:("DATA", token)), (r"\r$", None), ]) for line in database_file: results, remainder = xml_scanner.scan(line)
This is the only thing the script is doing at the moment.