How to get full node content using xpath & lxml?

Question

How to get full node content using xpath & lxml?

I am using lxml xpath to extract parts of a webpage. I am trying to get the contents of a tag that includes its own html tags. If i use

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]

I get the right number of nodes, but they are returned as lxml ( <Element font at 0x101fe5eb0>) objects .

If i use

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/text()

I get exactly what I want, except that I do not have the HTML code contained in the nodes .

If i use

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/node()

if you get a mixture of text and lxml elements! (e.g. something something <Element a at 0x102ac2140> something)

Is it possible to use a pure XPath query to retrieve the contents of nodes, or even make lxml return a string of content from a method .xpath(), rather than an lxml object?

, XPath, .

... something something <a href="url">inside</a> something - ...

<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>

+3

python html xpath lxml

significance 06 . '10 19:19

2

XPath ,  lxml .xpath(), lxml?
, XPath, .
... - <a href="url">inside</a> something - ...
<a
href= "url" > -

: .

XPath "",

, XPath.

node, outerXML - ( lxml).

: lxml tostring() outerXML .

+2

Dimitre Novatchev 06 . '10 19:28

unutbu · Accepted Answer · 2010-11-06T19:56:26+0000

, - , ?

import lxml.etree as le
import cStringIO
content='''\
<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>
'''
doc=le.parse(cStringIO.StringIO(content))

xpath='//font[@face="verdana" and @color="#ffffff" and @size="2"]/child::*'
x=doc.xpath(xpath)
print(map(le.tostring,x))
# ['<a href="url">inside</a> something']

How to get full node content using xpath & lxml?

More articles: