Scrapy xpath: select ancestor node
I have a question about xpath
<div id="A" > <div class="B"> <div class="C"> <div class="item"> <div class="area"> <div class="sec">USA</div> <table> <tbody> <tr> <td><a href="">D1</a></td> <td>D2</td> </tr> <tr class="even"> <td><a href="">E1</a></td> <td>E2</td> </tr> </tbody> </table> </div> <div class="area"> <div class="sec">UK</div> <table> <tbody> <tr> <td><a href="">F1</a></td> <td>F2</td> </tr> </tbody> </table> </div> </div> </div>> </div> </div> My code is:
sel = Selector(response) group = sel.xpath("//div[@id='A']/div[@class='B']/div[@class='C']/div[@class='item']/div[@class='area']/table/tbody/tr") for g in group: # section = g.xpath("").extract() #ancestor??? context = g.xpath("./td[1]/a/text()").extract() brief = g.xpath("./td[2]/text()").extract() # print section[0] print context[0] print brief[0] he will print:
D1 D2 E1 E2 F1 F2 But I want to print:
USA D1 D2 USA E1 E2 UK F1 F2 So I need to select the value of the parent node so that I can get USA and UK
I canβt figure this out for a while.
Please teach me to thank you!
In XPath, you can move back through the tree with .. , so a selector like this might work for you:
section = g.xpath('../../../div[@class="sec"]/text()').extract() Although this will work, it depends a lot on the specific structure of the document that you have. If you need a little more flexibility to talk about minor structural changes to the document, you can look back for an ancestor like this:
section = g.xpath('ancestor::div[@class="area"]/div[@class="sec"]/text()').extract() http://www.tizag.com/xmlTutorial/xpathparent.php is a good link.
Getting the parent can be done using xpathchild/..
from lxml import etree, html import urllib2 a='<div id="A" ><div class="B"><div class="C"><div class="item"><div class="area"><div class="sec">USA</div> <table> <tbody> <tr> <td><a href="">D1</a></td> <td>D2</td> </tr> <tr class="even"> <td><a href="">E1</a></td> <td>E2</td> </tr> </tbody> </table> </div> <div class="area"> <div class="sec">UK</div> <table> <tbody> <tr> <td><a href="">F1</a></td> <td>F2</td> </tr> </tbody> </table> </div> </div> </div> </div> </div>' tree = etree.fromstring(a) print filter(lambda x:x.strip(),tree.xpath('//div[@class="area"]//text()')) Exit: ['USA', 'D1', 'D2', 'E1', 'E2', 'UK', 'F1', 'F2']
// - extract all descendants / - extracts only direct children