Why can't a regular expression match an XML element?

This article claims that regular expressions cannot match nested structures because regular expressions are finite state machines.

He then offers a list of problems in which the answer states that the following cannot be resolved with regular expressions:

  • XML element mapping
  • matching math expression C / VB / C #
  • valid regular expression matching

Since 2 and 3 may contain brackets; this nesting is not allowed for regular expressions. But why is it impossible to map an XML element? (He did not give examples).

+6
source share
4 answers

You can match a limited subset of HTML tags if you know in advance which tags you want to map.

But you cannot (reliably or beautifully) parse arbitrary HTML. This is not an ordinary language.

+3
source

How would you match this correct regular expression XML?

<!--<d>>--<<--><div class='foo' id="bar" inline></div> 

He likes to make a wooden car. Of course, you can try to do this, but why?

But then comes the XML parsing part. How would you extract the many possible infinite attributes from an infinite set of elements using a finite set of groups? This is simply not possible due to the nature and structure of the regular expression.

+1
source

There are theoretical answers based on what an XML grammar is and what grammar regular expressions can match. These answers are sometimes erroneous in the fact that most of the regular expression libraries that we use today can do what formal regular expressions defined in computer science cannot (for example, backlinks).

And there are practical answers. The practical answer: do not do this because it is the wrong tool to work, your code will be difficult to write and unreachable, it will be inefficient, it will have errors, and no one will know how to change it when the structure of the document changes. And because there are better tools for this job called XML parsers.

+1
source

Regular expressions are stateless. To parse an XML file, you need state. A < may signal the opening of an XML element. If it contains a comment <!-- < --> or the value of the attribute "<" , although this means something else. Using Regexen, you can only express things in terms of things that happen before or after other things. To correctly parse < as opening an XML element, you will need to express something line by line:

< , but not after <!-- if <!-- not followed by --> , but not after " if " not closed, but only if " was an attribute, because " because the text value does not affect the next < , And if not...

And this is only for a simple < , not even covering all the possibilities. There are several special XML characters that have the same circular conditions. Building a regular expression that correctly expresses all these conditions for all cases is almost impossible. This is trivial with a state machine.

0
source

Source: https://habr.com/ru/post/889920/


All Articles