I would like for me to have a scrapy spider that starts before completion, but can start and stop in the middle when exceptions are reached.
Every time an error occurs, there might be something like:
def parse_inner_page(self, response):
if "Sorry, we just need to make sure you're not a robot" in response.body:
pass
therefore, this requires:
- Storing and updating some external state
n(for 2 ^ nseconds of waiting) - Ability to interrupt the spider from within the script (see above)
- The ability to restart the spider, including all responses / requests that failed, and return them to the same place bypassing (depth, metadata and all).
Is this possible during the screening process, and what would be the best way to code this?
, scrapy jobs, Python script :
import scrapy
from scrapy.crawler import CrawlerProcess
from myproject.spiders.mymodule import MySpider
process = CrawlerProcess({'JOBDIR': './job'})
process.crawl(MySpider)
process.start()
while ? process , ..?
- Ctrl + C?
import subprocess
import signal
process = subprocess.Popen(..)
process.send_signal(signal.SIGINT)