Scrapy start_urls

The script (below) from this tutorial contains two start_urls .

 from scrapy.spider import Spider from scrapy.selector import Selector from dirbot.items import Website class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", ] def parse(self, response): """ The lines below is a spider contract. For more info see: http://doc.scrapy.org/en/latest/topics/contracts.html @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/ @scrapes name """ sel = Selector(response) sites = sel.xpath('//ul[@class="directory-url"]/li') items = [] for site in sites: item = Website() item['name'] = site.xpath('a/text()').extract() item['url'] = site.xpath('a/@href').extract() item['description'] = site.xpath('text()').re('-\s[^\n]*\\r') items.append(item) return items 

But why does it only clear these 2 web pages? I see allowed_domains = ["dmoz.org"] , but these two pages contain links to other pages located in the dmoz.org domain! Why don't they scratch them?

+6
source share
6 answers

start_urls class attribute contains the start URLs - nothing more. If you retrieved the URLs of other pages that you want to clear, you will receive from parse callbacks of the corresponding requests using the [other] callback:

 class Spider(BaseSpider): name = 'my_spider' start_urls = [ 'http://www.domain.com/' ] allowed_domains = ['domain.com'] def parse(self, response): '''Parse main page and extract categories links.''' hxs = HtmlXPathSelector(response) urls = hxs.select("//*[@id='tSubmenuContent']/a[position()>1]/@href").extract() for url in urls: url = urlparse.urljoin(response.url, url) self.log('Found category url: %s' % url) yield Request(url, callback = self.parseCategory) def parseCategory(self, response): '''Parse category page and extract links of the items.''' hxs = HtmlXPathSelector(response) links = hxs.select("//*[@id='_list']//td[@class='tListDesc']/a/@href").extract() for link in links: itemLink = urlparse.urljoin(response.url, link) self.log('Found item link: %s' % itemLink, log.DEBUG) yield Request(itemLink, callback = self.parseItem) def parseItem(self, response): ... 

If you still want to configure the creation of launch requests, override the BaseSpider.start_requests () method

+15
source

start_urls contain the links from which the scan begins with a spider. If you want to scan recursively, you must use crawlspider and define the rules for this. http://doc.scrapy.org/en/latest/topics/spiders.html take a look there, for example.

+6
source

The class does not have the rules property. Take a look at http://readthedocs.org/docs/scrapy/en/latest/intro/overview.html and search for โ€œrulesโ€ to find an example.

+2
source

If you use BaseSpider , inside the callback you must extract the necessary URLs yourself and return the Request object.

If you use CrawlSpider , the extraction of links will be performed according to the rules and SgmlLinkExtractor associated with the rules.

+2
source

If you use the rule to follow links (which are already implemented in scrapy mode), the spider will also clear them. I hope that helped ...

  from scrapy.contrib.spiders import BaseSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector class Spider(BaseSpider): name = 'my_spider' start_urls = ['http://www.domain.com/'] allowed_domains = ['domain.com'] rules = [Rule(SgmlLinkExtractor(allow=[], deny[]), follow=True)] ... 
+1
source

You did not write a function to process the URLs that you want to receive. So there are two ways to reslolve.1.use the rule (crawlspider) 2: write a function to handle new URLs. and put them in a callback function.

0
source

Source: https://habr.com/ru/post/906252/


All Articles