How to get html elements with python lxml

Question

How to get html elements with python lxml

I have this HTML code:

<table> <tr> <td class="test"><b><a href="">aaa</a></b></td> <td class="test">bbb</td> <td class="test">ccc</td> <td class="test"><small>ddd</small></td> </tr> <tr> <td class="test"><b><a href="">eee</a></b></td> <td class="test">fff</td> <td class="test">ggg</td> <td class="test"><small>hhh</small></td> </tr> </table>

I use this Python code to retrieve the entire <td class="test"> using the lxml module.

 import urllib2 import lxml.html code = urllib.urlopen("http://www.example.com/page.html").read() html = lxml.html.fromstring(code) result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]')

It works well! Result:

 <td class="test"><b><a href="">aaa</a></b></td> <td class="test"><small>ddd</small></td> <td class="test"><b><a href="">eee</a></b></td> <td class="test"><small>hhh</small></td>

(so the first and fourth columns of each <tr> ) Now I have to extract:

aaa (link name)
ddd (text between the <small> )
eee (link name)
hhh (text between the <small> )

How can I extract these values?

(the problem is that I need to remove the <b> and get the anchor title in the first column and remove the <small> in the fourth column)

Thanks!

+4

python xml lxml

Damiano May 10, '10 at 23:50

source share

2 answers

Ian bicking · Answer 1 · 2010-05-11T02:13:07+0000

If you execute el.text_content() , you will separate all the elements of the tag from each element, el.text_content() .:

 result = [el.text_content() for el in result]

Jochen ritzel · Answer 2 · 2010-05-11T01:20:10+0000

Why don't you just get what you want at every turn?

 links = [el.text for el in html.xpath('//td[@class="test"][position() = 1]/b/a')] smalls = [el.text for el in html.xpath('//td[@class="test"][position() = 4]/small')] print zip(links, smalls) # => [('aaa', 'ddd'), ('eee', 'hhh')]

How to get html elements with python lxml

More articles: