Parse HTML with CSS or XPath selectors?

My goal is to parse HTML with lxml, which supports both XPath and CSS selectors.

I can bind my model properties to CSS or XPath, but I'm not sure which one would be the best, for example. less noise when changing the layout of HTML, simpler expressions, greater extraction speed.

What would you choose in such a situation?

+4
source share
1 answer

Who are you comfortable with? Most people tend to look for CSS selectors, and if others will support your work, you should take this into account. One reason for this may be that you worry less about XML namespaces, which are the source of many errors. CSS selectors are generally more compact than XPath equivalents, but only you can decide if this factor is or not. I would point out that it is no coincidence that the jquery select language is modeled using CSS selectors, not XPath.

XPath, on the other hand, is a more expressive language for general DOM manipulation. For example, there is no CSS selector equivalent to the parent or ancestor axes, and there is no way to directly access text nodes equivalent to "text ()" in XPath. On the contrary, I cannot imagine a single DOM path that can be expressed in CSS selectors, but not in XPath, although E [foo ~ = "warning"] and E [lang | = "en"] is explicitly complex in XPath.

Which CSS selectors have XPath, these are not pseudo-classes, but if you are doing server-side DOM manipulations, they are unlikely to be useful to you.

As for what leads to faster extraction, I don't know lxml, but I would expect equivalent paths to have very similar performance characteristics.

+9
source

Source: https://habr.com/ru/post/1306618/


All Articles