How to get html elements with python lxml
I have this HTML code:
<table> <tr> <td class="test"><b><a href="">aaa</a></b></td> <td class="test">bbb</td> <td class="test">ccc</td> <td class="test"><small>ddd</small></td> </tr> <tr> <td class="test"><b><a href="">eee</a></b></td> <td class="test">fff</td> <td class="test">ggg</td> <td class="test"><small>hhh</small></td> </tr> </table> I use this Python code to retrieve the entire <td class="test"> using the lxml module.
import urllib2 import lxml.html code = urllib.urlopen("http://www.example.com/page.html").read() html = lxml.html.fromstring(code) result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]') It works well! Result:
<td class="test"><b><a href="">aaa</a></b></td> <td class="test"><small>ddd</small></td> <td class="test"><b><a href="">eee</a></b></td> <td class="test"><small>hhh</small></td> (so the first and fourth columns of each <tr> ) Now I have to extract:
aaa (link name)
ddd (text between the
<small>)eee (link name)
hhh (text between the
<small>)
How can I extract these values?
(the problem is that I need to remove the <b> and get the anchor title in the first column and remove the <small> in the fourth column)
Thanks!
+4
2 answers