.xpath().extract()
and .css().extract()
return a list because .xpath()
and .css()
return SelectorList
objects.
See https://parsel.readthedocs.org/en/v1.0.1/usage.html#parsel.selector.SelectorList.extract
(SelectorList) .extract ():
Calling the .extract () method for each element is a list and returns their results are smoothed out like a list of strings in Unicode.
.extract_first()
is what you are looking for (which is poorly documented)
Taken from http://doc.scrapy.org/en/latest/topics/selectors.html :
If you want to extract only the first matching element, you can call the .extract_first()
selector
>>> response.xpath('//div[@id="images"]/a/text()').extract_first() u'Name: My image 1 '
In another example:
def parse(self, response): for href in response.css("ul.directory.dir-col > li > a::attr('href')"): link = href.extract() print(link)
every href
in the loop will be a Selector
object. Calling .extract()
on it will give you one Unicode line:
$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/" 2016-02-26 12:11:36 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot) (...) In [1]: response.css("ul.directory.dir-col > li > a::attr('href')") Out[1]: [<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>, <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>, ... <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>]
therefore .css()
on response
returns a SelectorList
:
In [2]: type(response.css("ul.directory.dir-col > li > a::attr('href')")) Out[2]: scrapy.selector.unified.SelectorList
The loop on this object gives you Selector
instances:
In [5]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"): ...: print href ...: <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'> <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'> (...) <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
And calling .extract()
gives you one Unicode line:
In [6]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"): print type(href.extract()) ...: <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'> <type 'unicode'>
Note: .extract()
on Selector
not properly documented as a return list of strings. I will open the question on parsel
(which matches the Scrapy selectors and is used under the hood in scrapy 1.1 +)