I believe that your program will face many problems, as input data is entered - what if the βnameβ changes, or is there a typo?
It is not possible to make a strict decision to scrap another site, as they cannot completely change everything. It is usually better to write a tolerant and flexible code that at least tries to verify that its output is normal. In this case, it is best to iterate over the results of "// table / tr", then process td elements inside this loop:
import lxml.etree tree = lxml.etree.fromstring("<table><tr><td>test</td></tr><tr><td><div>test2</div></td></tr></table>") stringify = lambda x : "".join(x.xpath(".//text()")) for x in tree.xpath("//table/tr"): print "New row" for y in x.xpath("td"): print stringify(y)
Output:
New row test New row test2
The following code, however, will get the list you request:
print map(stringify, tree.xpath("//table/tr/td"))
Output:
['test', 'test2']
Here all text elements that generally come from td, which is a direct descendant of tr, which, in turn, is a direct descendant of the table, will be found.
(A simple query for all text () elements will create some funny errors when launched in HTML that contains "<td> Foo <b> bar </b> </td>" or similar.)
bucko source share