XPath for the first occurrence of an element with a text length> = 200 characters

How to get the first element that has inner text (plain text, dropping other children) 200 or more characters long?

I am trying to create an HTML parser, for example Embed.ly , and I created a backup system where I check og:description , then I would look for this event and only then for the description meta tag.

This is due to the fact that most sites that even include meta description describe their site in this tag, and not the contents of the current page.

Example:

 <html> <body> <div>some characters <p>200 characters <span>some more stuff</span></p> </div> </body> </html> 

Which selector can be used to get 200 characters of part of this HTML fragment? I also don’t want any other material, I don’t care what kind of element (except <script> or <style> ), if it is the first simple text contains at least 200 characters.

What does an XPath query look like?

+6
source share
3 answers

Using

 (//*[not(self::script or self::style)]/text()[string-length() > 200])[1] 

Note If the document is an XHTML document (and this means that all elements are in the xhrml namespace), the above expression should be specified as:

 (//*[not(self::x:script or self::x:style)]/text()[string-length() > 200])[1] 

where the prefix "x:" must be bound to the XHTML namespace - "http://www.w3.org/1999/xhtml" (or as many XPath APIs call it - the namespace must be "registered" with this prefix)

+7
source

I meant something like this:

 root.SelectNodes("html/body/.//*[(name() !='script') and (name()!='style')]/text()[string-length() > 200]") 

It seems to work very well.

+2
source

HTML is not XML. You should not use XML parsers to parse the HTML period. These are two different things, and your parser will suppress the first time you see html that has not generated XML.

You should find an open source parser instead of folding your own.

0
source

Source: https://habr.com/ru/post/910043/


All Articles