Extract specific element from nested elements using lxml html

Question

Extract specific element from nested elements using lxml html

Hi everyone, I have some problems that, it seems to me, can be attributed to xpath problems. I am using the html module from the lxml package to try to get some data. I provide the most simplified situation below, but remember that the html I work with is much uglier.

<table> <tr> <td> <table> <tr><td></td></tr> <tr><td> <table> <tr><td><u><b>Header1</b></u></td></tr> <tr><td>Data</td></tr> </table> </td></tr> </table> </td></tr> </table>

What I really want is a deeply nested table, because it has the header text "Header1". I try so hard:

 from lxml import html page = '...' tree = html.fromstring(page) print tree.xpath('//table[//*[contains(text(), "Header1")]]')

but it gives me all the elements of the table. I just want one table to contain this text. I understand what is happening, but it’s hard for me to figure out how to do this, in addition to wresting some unpleasant regex. Any suggestions?

+4

python html parsing xpath lxml

Dan.StackOverflow Apr 14 '10 at 4:40

source share

4 answers

Find the title you are interested in, and then pull out its table.

  // u [b = 'Header1'] / ancestor :: table [1]

or

  //td[not(.//table) and .//b = 'Header1'] / ancestor :: table [1]

Note that // always starts with the document root (!). You can not:

  // table [// * [contains (text (), "Header1")]]

and expect the internal predicate ( //*… ) to magically start in the right context. Use .// to run in the context of node. Even then, this is:

  // table [.//* [contains (text (), "Header1")]]

will not work, since even the outermost table contains the text 'Header1' somewhere deep, so the predicate evaluates to true for each table in your example. Use not() , as I did to make sure other tables are not nested.

Also, do not check the condition on each node .//* , as it cannot be true for each node. It is more effective to be specific.

+2

Tomalak Apr 14 '10 at 8:47

source share

Maybe this will work for you:

 tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")

The not(descendant::table) bit ensures that you get the innermost table.

0

Michał Marczyk Apr 14 '10 at 5:48

source share

 table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')

//*[text()="Header1"] selects an element anywhere in the document with the text Header1 .
ancestor::table[1] selects the first ancestor of the table element.

Full example

 #!/usr/bin/env python from lxml import html page = """ <table> <tr> <td> <table> <tr><td></td></tr> <tr><td> <table> <tr><td><u><b>Header1</b></u></td></tr> <tr><td>Data</td></tr> </table> </td></tr> </table> </td></tr> </table> """ tree = html.fromstring(page) table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]') print html.tostring(table)

0

jfs Apr 14 '10 at 6:05

source share

Dimitre novatchev · Accepted Answer · 2010-04-14T13:04:24+0000

Using

 //td[text() = 'Header1']/ancestor::table[1]

Extract specific element from nested elements using lxml html

Full example

More articles: