aaabb...">

How to get html elements with python lxml

I have this HTML code:

<table> <tr> <td class="test"><b><a href="">aaa</a></b></td> <td class="test">bbb</td> <td class="test">ccc</td> <td class="test"><small>ddd</small></td> </tr> <tr> <td class="test"><b><a href="">eee</a></b></td> <td class="test">fff</td> <td class="test">ggg</td> <td class="test"><small>hhh</small></td> </tr> </table> 

I use this Python code to retrieve the entire <td class="test"> using the lxml module.

 import urllib2 import lxml.html code = urllib.urlopen("http://www.example.com/page.html").read() html = lxml.html.fromstring(code) result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]') 

It works well! Result:

 <td class="test"><b><a href="">aaa</a></b></td> <td class="test"><small>ddd</small></td> <td class="test"><b><a href="">eee</a></b></td> <td class="test"><small>hhh</small></td> 

(so the first and fourth columns of each <tr> ) Now I have to extract:

aaa (link name)

ddd (text between the <small> )

eee (link name)

hhh (text between the <small> )

How can I extract these values?

(the problem is that I need to remove the <b> and get the anchor title in the first column and remove the <small> in the fourth column)

Thanks!

+4
source share
2 answers

If you execute el.text_content() , you will separate all the elements of the tag from each element, el.text_content() .:

 result = [el.text_content() for el in result] 
+8
source

Why don't you just get what you want at every turn?

 links = [el.text for el in html.xpath('//td[@class="test"][position() = 1]/b/a')] smalls = [el.text for el in html.xpath('//td[@class="test"][position() = 4]/small')] print zip(links, smalls) # => [('aaa', 'ddd'), ('eee', 'hhh')] 
+4
source

Source: https://habr.com/ru/post/1309373/


All Articles