Scope exponential backoff to restart spider scan after each exception

I would like for me to have a scrapy spider that starts before completion, but can start and stop in the middle when exceptions are reached.

Every time an error occurs, there might be something like:

def parse_inner_page(self, response):
    if "Sorry, we just need to make sure you're not a robot" in response.body:
        # want to stop the spider here and have the entire spider
        # remember states of request and retry after 2 seconds, 
        # and if that fails, wait 2^2 seconds, 2^3 after that, etc...
        pass

therefore, this requires:

  • Storing and updating some external state n(for 2 ^ nseconds of waiting)
  • Ability to interrupt the spider from within the script (see above)
  • The ability to restart the spider, including all responses / requests that failed, and return them to the same place bypassing (depth, metadata and all).

Is this possible during the screening process, and what would be the best way to code this?

, scrapy jobs, Python script :

import scrapy
from scrapy.crawler import CrawlerProcess

from myproject.spiders.mymodule import MySpider

#https://doc.scrapy.org/en/latest/topics/jobs.html

process = CrawlerProcess({'JOBDIR': './job'})
process.crawl(MySpider)
process.start()

while ? process , ..?

- Ctrl + C?

import subprocess
import signal

process = subprocess.Popen(..)
process.send_signal(signal.SIGINT)
+4

Source: https://habr.com/ru/post/1667592/


All Articles