Regular expressions against XPath when parsing HTML text

Question

Regular expressions against XPath when parsing HTML text

I want to parse HTML text and find special parts. For example, the text in the 3rd div of the 1st row and 2 column for table . I have two analysis options: regular expressions and XPath. What are the advantages and disadvantages of each?

thanks

+4

html regex parsing html-parsing xpath

Afshar mohebbi Aug 25 '11 at 20:41

source share

4 answers

To some extent, it depends on whether you have a full HTML file of unknown but well-formed content, or just a fragment or simplicity of HTML of fully-known content, which may or may not be well-formed.

There is a difference between editing and parsing.

It's one thing to edit your own HTML file that you yourself wrote, or else look directly in person, and you issue an editor command

 :100,200s!<br */>!!g

To remove gaps from lines 200-300.

It is a completely different matter to suck any HTML code located on the other end of the URL, and then try to understand it, see the invisible.

The first call to solve regular expressions - the one shown above. To refuse to write some kind of massively redefined hippo, to make a patch analysis, to configure the entire parse tree just to make the simple editing shown above, is quite simply wrong. It is also his own punishment.

On the other hand, using templates for analyzing (as opposed to lex out) an entire HTML document that can contain all kinds of hacking things that you plan to just scream to use someone elses, hard work to recreate the wheel for yourself and is bad in this one.

However, something else no one likes to mention, and this is what most people are simply not competent at regexes. They really do not understand them. They do not know how to test them or create them. They do not know how to make them readable and maintainable.

The truth is that the vast majority of regular expression users cannot handle the simple and simple thing as a suitable arbitrary HTML tag using a regular expression, even if it looks like alternating CDATA encodings and sections and overrides of rights and <script> contents and archaic unprecedented forms are all safely liberated.

Its not because it's hard to do; in fact, this is not so. It’s just that people trying to understand this do not understand either regular expressions or HTML very well, and they don’t know what they don’t know, and therefore they deal with their heads faster than they understand. And then they have complete misfortune in their hands.

Plus it was done before, and rightly so. Could also learn from someone elses errors to change, eh? This would probably help to have some complete regular expressions at your disposal to manipulate things often. This is especially useful for editing.

But for complete analysis, you really should not try to embed the full HTML grammar in your template. Honestly, you really shouldn't. Saying that someone really could have done it, I, unlike 99.9999% of respondents, have earned trust in real experience in this area when I advise him. Of course, I can do this, but I almost never want this, and I certainly do not want you to try it at home unattended. I cannot be held responsible for any damage that may occur. :)

Of course, it may sound like this: "Do as I say, not like me," but if your level of regular expression was at a level that allows you to contemplate such a thing, you would not ask this question. As I mentioned, almost no one who uses regular expressions can actually match an arbitrary HTML tag, just like that. Given that you need such a building block before writing a recursive descent grammar, and given that no one can even control this simple building block nearby, well ...

Given this sad state of affairs, it might be best to use regular expressions for simple editing tasks and leave them to use for more complete solutions for real regular expression wizards, as they are subtle and quick for anger. The point, of course, is regular expressions, not wizards.

But believe that some finished regular expressions are convenient for easy editing, not full parsing. Thus, you will not be forced to rebuild them every time from the first principles. I save some of them, but then I also save simple frameworks that allow me to edit a specific HTML structural element, for example, plain text or the contents of a tag or link link, etc., And they all use the full parser, which allows me then Surgically target only the parts that I want in complete confidence. I have not forgotten anything.

Moreover, as evidence of what is possible than what is desired, you can see some answers with more, “heroic” pattern matching, including recursion, here , here , here , here , here , and here .

Understand that some of them were actually written to show people why they should not use regular expressions, because some of them are really quite complex, much more consistent than you would expect in nonwizards. This difficulty may haunt you, which is good because it was kind.

But do not let this stop you from using vi in your HTML files and not scare you from using its search or replace commands. Do not let perfection be the enemy of good. Sometimes good enough, exactly what you need, because the ideal would take more investment than it could cost.

Understanding that out of several possible approaches will give you the best chance for your dollar is something that takes time to learn, and no one can tell you the answer that works for you. They do not know your data set, your requirements, your skill set, your priorities. Therefore, any definitive answer is automatically erroneous. You must evaluate these things yourself.

+7

tchrist Aug 26 '11 at 0:29

source share

XPath is less likely to break if a web developer makes any minor changes. That would be my choice.

+3

Ed heal Aug 25 '11 at 20:45

source share

Here's a canonical explanation of Stackoverflow why you shouldn't parse HTML with regular expression:

Open RegEx tags except XHTML tags contained offline

In general, you cannot parse HTML with a regular expression because the regular expression is not used to parse HTML. Just use XPath.

+2

Jared ng Aug 25 '11 at 20:47

source share

Raphael · Accepted Answer · 2011-08-25T20:48:29+0000

I think XPath is the main option for going through XML-like documents. With RegExp you will need to handle various forms of writing a tag (with multiple spaces, double quotes, single quotes, without quotes, on one line, in multi-line lines, with internal data, without internal data, etc.). With XPath, all this is transparent to you, and it has many functions (for example, access to node by index, selection by attribute values, selection of simblings, and MUCH more).

See how powerful http://www.w3schools.com/xpath/ can be.

EDIT: See also. How do HTML parses work if they don't use regexp?

Regular expressions against XPath when parsing HTML text

More articles: