Regular expressions are stateless. To parse an XML file, you need state. A < may signal the opening of an XML element. If it contains a comment <!-- < --> or the value of the attribute "<" , although this means something else. Using Regexen, you can only express things in terms of things that happen before or after other things. To correctly parse < as opening an XML element, you will need to express something line by line:
< , but not after <!-- if <!-- not followed by --> , but not after " if " not closed, but only if " was an attribute, because " because the text value does not affect the next < , And if not...
And this is only for a simple < , not even covering all the possibilities. There are several special XML characters that have the same circular conditions. Building a regular expression that correctly expresses all these conditions for all cases is almost impossible. This is trivial with a state machine.
source share