Extract specific element from nested elements using lxml html

Hi everyone, I have some problems that, it seems to me, can be attributed to xpath problems. I am using the html module from the lxml package to try to get some data. I provide the most simplified situation below, but remember that the html I work with is much uglier.

<table> <tr> <td> <table> <tr><td></td></tr> <tr><td> <table> <tr><td><u><b>Header1</b></u></td></tr> <tr><td>Data</td></tr> </table> </td></tr> </table> </td></tr> </table> 

What I really want is a deeply nested table, because it has the header text "Header1". I try so hard:

 from lxml import html page = '...' tree = html.fromstring(page) print tree.xpath('//table[//*[contains(text(), "Header1")]]') 

but it gives me all the elements of the table. I just want one table to contain this text. I understand what is happening, but it’s hard for me to figure out how to do this, in addition to wresting some unpleasant regex. Any suggestions?

+4
source share
4 answers

Using

 //td[text() = 'Header1']/ancestor::table[1] 
+3
source

Find the title you are interested in, and then pull out its table.

  // u [b = 'Header1'] / ancestor :: table [1]

or

  //td[not(.//table) and .//b = 'Header1'] / ancestor :: table [1]

Note that // always starts with the document root (!). You can not:

  // table [// * [contains (text (), "Header1")]]

and expect the internal predicate ( //*… ) to magically start in the right context. Use .// to run in the context of node. Even then, this is:

  // table [.//* [contains (text (), "Header1")]]

will not work, since even the outermost table contains the text 'Header1' somewhere deep, so the predicate evaluates to true for each table in your example. Use not() , as I did to make sure other tables are not nested.

Also, do not check the condition on each node .//* , as it cannot be true for each node. It is more effective to be specific.

+2
source

Maybe this will work for you:

 tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]") 

The not(descendant::table) bit ensures that you get the innermost table.

0
source
 table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]') 
  • //*[text()="Header1"] selects an element anywhere in the document with the text Header1 .
  • ancestor::table[1] selects the first ancestor of the table element.

Full example

 #!/usr/bin/env python from lxml import html page = """ <table> <tr> <td> <table> <tr><td></td></tr> <tr><td> <table> <tr><td><u><b>Header1</b></u></td></tr> <tr><td>Data</td></tr> </table> </td></tr> </table> </td></tr> </table> """ tree = html.fromstring(page) table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]') print html.tostring(table) 
0
source

Source: https://habr.com/ru/post/1306880/


All Articles