Scrapy crawls the first page, but does not follow links

I cannot understand why Scrapy crawls the first page, but does not follow the links to scan subsequent pages. This has to be something to do with the Rules. Very grateful. Thanks!

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from craigslist_sample.items import CraigslistItem class MySpider(CrawlSpider): name = "craig" allowed_domains = ["sfbay.craigslist.org"] start_urls = ["http://sfbay.craigslist.org/acc/"] rules = (Rule (SgmlLinkExtractor(allow=("index100\.html", ),restrict_xpaths=('//p[@id="nextpage"]',)) , callback="parse_items", follow= True), ) def parse_items(self, response): hxs = HtmlXPathSelector(response) titles = hxs.select("//p") items = [] for titles in titles: item = CraigslistItem() item ["title"] = titles.select("a/text()").extract() item ["link"] = titles.select("a/@href").extract() items.append(item) return(items) spider = MySpider() 
+4
source share
1 answer

Craig uses index100 , index200 , index300 ... for the following pages, max - index900 .

 rules = (Rule (SgmlLinkExtractor(allow=("index\d00\.html", ),restrict_xpaths=('//p[@id="nextpage"]',)) , callback="parse_items", follow= True), ) 

works for me.

+8
source

Source: https://habr.com/ru/post/1444051/


All Articles