Scrapy download_delay vs max_concurrent_requests_per_domain

I am very confused by the differences and interactions between DOWNLOAD_DELAYand MAX_CONCURRENT_REQUESTS_PER_DOMAINin Scrapy.

The load delay affects the maximum number of simultaneous requests for the domain, for example, if I set the delay to 10 seconds, but allow 8 simultaneous requests for the domain, whether these parallel requests will not start simultaneously, but staggered according to the load delay, or they will be launched at the same time, but will loading responses be stepwise? Is there any reason DOWNLOAD_DELAYnot called REQUEST_DELAY?

For example, what would the reverse transaction throughput calculation be in the following scenario:

  • start_urls contains 100 URLs for this domain
  • MAX_CONCURRENT_REQUESTS_PER_DOMAIN = 8
  • DOWNLOAD_DELAY = 3
  • Suppose a server needs 2 seconds to generate a response
  • suppose we don’t create more urls than already in start_urls

How long will the associated spider take to process this queue?

+4
source share
1 answer

From source code loader

conc = self.ip_concurrency if self.ip_concurrency else self.domain_concurrency
conc, delay = _get_concurrency_delay(conc, spider, self.settings)

So, it looks like the behavior will be the same as this , which says

This setting also affects DOWNLOAD_DELAY: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.

Therefore, I do not think that you will achieve a large amount of concurrency with a large download_delay. I ran search robots in a slow network with autorouting, and at the same time there were no more than 2-3 simultaneous requests.

+1
source

Source: https://habr.com/ru/post/1569654/


All Articles