Scrapy start_urls

Question

Scrapy start_urls

The script (below) from this tutorial contains two start_urls .

 from scrapy.spider import Spider from scrapy.selector import Selector from dirbot.items import Website class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", ] def parse(self, response): """ The lines below is a spider contract. For more info see: http://doc.scrapy.org/en/latest/topics/contracts.html @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/ @scrapes name """ sel = Selector(response) sites = sel.xpath('//ul[@class="directory-url"]/li') items = [] for site in sites: item = Website() item['name'] = site.xpath('a/text()').extract() item['url'] = site.xpath('a/@href').extract() item['description'] = site.xpath('text()').re('-\s[^\n]*\\r') items.append(item) return items

But why does it only clear these 2 web pages? I see allowed_domains = ["dmoz.org"] , but these two pages contain links to other pages located in the dmoz.org domain! Why don't they scratch them?

+6

python scrapy

DrStrangeLove Jan 18 '12 at 0:39

source share

6 answers

warvariuc · Answer 1 · 2012-01-18T06:29:19+0000

start_urls class attribute contains the start URLs - nothing more. If you retrieved the URLs of other pages that you want to clear, you will receive from parse callbacks of the corresponding requests using the [other] callback:

 class Spider(BaseSpider): name = 'my_spider' start_urls = [ 'http://www.domain.com/' ] allowed_domains = ['domain.com'] def parse(self, response): '''Parse main page and extract categories links.''' hxs = HtmlXPathSelector(response) urls = hxs.select("//*[@id='tSubmenuContent']/a[position()>1]/@href").extract() for url in urls: url = urlparse.urljoin(response.url, url) self.log('Found category url: %s' % url) yield Request(url, callback = self.parseCategory) def parseCategory(self, response): '''Parse category page and extract links of the items.''' hxs = HtmlXPathSelector(response) links = hxs.select("//*[@id='_list']//td[@class='tListDesc']/a/@href").extract() for link in links: itemLink = urlparse.urljoin(response.url, link) self.log('Found item link: %s' % itemLink, log.DEBUG) yield Request(itemLink, callback = self.parseItem) def parseItem(self, response): ...

If you still want to configure the creation of launch requests, override the BaseSpider.start_requests () method

Mohit gupta · Answer 2 · 2013-09-10T09:49:17+0000

start_urls contain the links from which the scan begins with a spider. If you want to scan recursively, you must use crawlspider and define the rules for this. http://doc.scrapy.org/en/latest/topics/spiders.html take a look there, for example.

Glenn · Answer 3 · 2012-01-18T00:58:02+0000

The class does not have the rules property. Take a look at http://readthedocs.org/docs/scrapy/en/latest/intro/overview.html and search for “rules” to find an example.

goh · Answer 4 · 2012-01-18T06:04:43+0000

If you use BaseSpider , inside the callback you must extract the necessary URLs yourself and return the Request object.

If you use CrawlSpider , the extraction of links will be performed according to the rules and SgmlLinkExtractor associated with the rules.

francisco · Answer 5 · 2014-11-26T23:36:37+0000

If you use the rule to follow links (which are already implemented in scrapy mode), the spider will also clear them. I hope that helped ...

  from scrapy.contrib.spiders import BaseSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector class Spider(BaseSpider): name = 'my_spider' start_urls = ['http://www.domain.com/'] allowed_domains = ['domain.com'] rules = [Rule(SgmlLinkExtractor(allow=[], deny[]), follow=True)] ...

Gavin · Answer 6 · 2017-07-10T07:12:20+0000

You did not write a function to process the URLs that you want to receive. So there are two ways to reslolve.1.use the rule (crawlspider) 2: write a function to handle new URLs. and put them in a callback function.

Scrapy start_urls

More articles: