Why does xpath inside the select loop still return a list in the tutorial

Question

Why does xpath inside the select loop still return a list in the tutorial

I study scrapy with a tutorial: http://doc.scrapy.org/en/1.0/intro/tutorial.html

When I ran the following example script in a tutorial. I found that although it was already scrolling through the list of selectors, the tile obtained from sel.xpath('a/text()').extract() was still a list containing one line. Like [u'Python 3 Object Oriented Programming'] , not u'Python 3 Object Oriented Programming' . In a later example, the list is assigned to the element as item['title'] = sel.xpath('a/text()').extract() , which I think is illogically correct.

 import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): for sel in response.xpath('//ul/li'): title = sel.xpath('a/text()').extract() link = sel.xpath('a/@href').extract() desc = sel.xpath('text()').extract() print title, link, desc

However, if I use the following code:

 import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/", ] def parse(self, response): for href in response.css("ul.directory.dir-col > li > a::attr('href')"): link = href.extract() print(link)

link is a string, not a list.

Is this error or intended?

+5

xpath scrapy

entron Feb 26 '16 at 10:37

source share

1 answer

paul trmbrth · Accepted Answer · 2016-02-26T10:44:05+0000

.xpath().extract() and .css().extract() return a list because .xpath() and .css() return SelectorList objects.

See https://parsel.readthedocs.org/en/v1.0.1/usage.html#parsel.selector.SelectorList.extract

(SelectorList) .extract ():
Calling the .extract () method for each element is a list and returns their results are smoothed out like a list of strings in Unicode.

.extract_first() is what you are looking for (which is poorly documented)

Taken from http://doc.scrapy.org/en/latest/topics/selectors.html :

If you want to extract only the first matching element, you can call the .extract_first() selector

 >>> response.xpath('//div[@id="images"]/a/text()').extract_first() u'Name: My image 1 '

In another example:

 def parse(self, response): for href in response.css("ul.directory.dir-col > li > a::attr('href')"): link = href.extract() print(link)

every href in the loop will be a Selector object. Calling .extract() on it will give you one Unicode line:

 $ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/" 2016-02-26 12:11:36 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot) (...) In [1]: response.css("ul.directory.dir-col > li > a::attr('href')") Out[1]: [<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>, <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>, ... <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>]

therefore .css() on response returns a SelectorList :

 In [2]: type(response.css("ul.directory.dir-col > li > a::attr('href')")) Out[2]: scrapy.selector.unified.SelectorList

The loop on this object gives you Selector instances:

 In [5]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"): ...: print href ...: <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'> <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'> (...) <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>

And calling .extract() gives you one Unicode line:

 In [6]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"): print type(href.extract()) ...: <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'>

Note: .extract() on Selector not properly documented as a return list of strings. I will open the question on parsel (which matches the Scrapy selectors and is used under the hood in scrapy 1.1 +)

Why does xpath inside the select loop still return a list in the tutorial

More articles: