Can someone explain to me how the pause / resume function works in Scrapy ?
The version of Scrapy I am using is 0.24.5
The documentation does not contain details.
I have the following simple spider:
class SampleSpider(Spider): name = 'sample' def start_requests(self): yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1053') yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1054') yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1055') def parse(self, response): with open('responses.txt', 'a') as f: f.write(response.url + '\n')
I run it using:
from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy import log, signals from scrapyproject.spiders.sample_spider import SampleSpider spider = SampleSpider() settings = get_project_settings() settings.set('JOBDIR', '/some/path/scrapy_cache') settings.set('DOWNLOAD_DELAY', 10) crawler = Crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.configure() crawler.crawl(spider) crawler.start() log.start() reactor.run()
As you can see, I turned on the JOBDIR parameter to save the state of my crawl.
I set DOWNLOAD_DELAY to 10 seconds so that I can stop the spider before processing the requests. I would expect that the next time I run the spider, the requests will not be restored. This is not true.
In the scrapy_cache folder, I see a folder called request.queue. However, it is always empty.
It looks like the request.seen file saves the saved requests (using SHA1 hashes), which is great. However, the next time I launch the spider, requests are regenerated and hashes (tl2) SHA1 are added to the file. I tracked this issue in Scrapy code, and it looks like RFPDupeFilter opening the request.seen file with the 'a +' flag. Therefore, it will always discard previous values ββin the file (at least this behavior on my Mac OS X).
Finally, with regard to the state of the spider, I can see from the Scrapy code that the state of the spider is preserved when the spider is closed and is read when it is opened. However, this is not very useful if an exception occurs (for example, the machine turns off). Should I be saved periodically?
The main question I have is this: what is the common practice of using Scrapy , expecting the crawl to stop / resume several times (for example, when crawling on a very large website)?