Toolbox depth limit for resolved_domain

Question

Toolbox depth limit for resolved_domain

I scan 6 different allowed_ domains and would like to limit the depth of 1 domain. How can I limit the depth of this 1 domain in scrapy? Or could you only scan one depth of remote domains?

+5

python web-crawler web-scraping scrapy

E liquid vape Jan 6 '15 at 19:40

source share

1 answer

alecxe · Accepted Answer · 2015-01-07T02:21:25+0000

Scrapy does not provide anything like this. You can set DEPTH_LIMIT per-spider , but not for the domain.

What we can do? Read the code , drink coffee and solve it (order is important).

The idea is to disable Scrapy's built-in DepthMiddleware and provide our custom option .

First define the settings:

DOMAIN_DEPTHS is a dictionary with depth restrictions for the domain
DEPTH_LIMIT we will leave by default if the domain is not configured.

Settings example:

 DOMAIN_DEPTHS = {'amazon.com': 1, 'homedepot.com': 4} DEPTH_LIMIT = 3

Ok, now custom middleware (based on DepthMiddleware ):

 from scrapy import log from scrapy.http import Request import tldextract class DomainDepthMiddleware(object): def __init__(self, domain_depths, default_depth): self.domain_depths = domain_depths self.default_depth = default_depth @classmethod def from_crawler(cls, crawler): settings = crawler.settings domain_depths = settings.getdict('DOMAIN_DEPTHS', default={}) default_depth = settings.getint('DEPTH_LIMIT', 1) return cls(domain_depths, default_depth) def process_spider_output(self, response, result, spider): def _filter(request): if isinstance(request, Request): # get max depth per domain domain = tldextract.extract(request.url).registered_domain maxdepth = self.domain_depths.get(domain, self.default_depth) depth = response.meta.get('depth', 0) + 1 request.meta['depth'] = depth if maxdepth and depth > maxdepth: log.msg(format="Ignoring link (depth > %(maxdepth)d): %(requrl)s ", level=log.DEBUG, spider=spider, maxdepth=maxdepth, requrl=request.url) return False return True return (r for r in result or () if _filter(r))

Note that it requires a tldextract module (used to extract the domain name from the URL):

 >>> import tldextract >>> url = 'http://stackoverflow.com/questions/27805952/scrapy-set-depth-limit-per-allowed-domains' >>> tldextract.extract(url).registered_domain 'stackoverflow.com'

Now we need to disable the middleware by default and use the one we implemented:

 SPIDER_MIDDLEWARES = { 'myproject.middlewares.DomainDepthMiddleware': 900, 'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': None }

Toolbox depth limit for resolved_domain

More articles: