How does pause / resume Scrapy work?

Question

How does pause / resume Scrapy work?

Can someone explain to me how the pause / resume function works in Scrapy ?

The version of Scrapy I am using is 0.24.5

The documentation does not contain details.

I have the following simple spider:

 class SampleSpider(Spider): name = 'sample' def start_requests(self): yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1053') yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1054') yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1055') def parse(self, response): with open('responses.txt', 'a') as f: f.write(response.url + '\n')

I run it using:

 from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy import log, signals from scrapyproject.spiders.sample_spider import SampleSpider spider = SampleSpider() settings = get_project_settings() settings.set('JOBDIR', '/some/path/scrapy_cache') settings.set('DOWNLOAD_DELAY', 10) crawler = Crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.configure() crawler.crawl(spider) crawler.start() log.start() reactor.run()

As you can see, I turned on the JOBDIR parameter to save the state of my crawl.

I set DOWNLOAD_DELAY to 10 seconds so that I can stop the spider before processing the requests. I would expect that the next time I run the spider, the requests will not be restored. This is not true.

In the scrapy_cache folder, I see a folder called request.queue. However, it is always empty.

It looks like the request.seen file saves the saved requests (using SHA1 hashes), which is great. However, the next time I launch the spider, requests are regenerated and hashes (tl2) SHA1 are added to the file. I tracked this issue in Scrapy code, and it looks like RFPDupeFilter opening the request.seen file with the 'a +' flag. Therefore, it will always discard previous values in the file (at least this behavior on my Mac OS X).

Finally, with regard to the state of the spider, I can see from the Scrapy code that the state of the spider is preserved when the spider is closed and is read when it is opened. However, this is not very useful if an exception occurs (for example, the machine turns off). Should I be saved periodically?

The main question I have is this: what is the common practice of using Scrapy , expecting the crawl to stop / resume several times (for example, when crawling on a very large website)?

+6

scrapy

Abdul Mar 04 '15 at 10:03

source share

3 answers

Maryamh · Answer 1 · 2015-10-06T10:30:58+0000

To suspend and resume searching in scrapy, you can run this command to start the search:

 scrapy crawl somespider --set JOBDIR=crawl1

to stop the search, you need to run control-C, but you have to run it only once and wait until scrapy is finished, if you run control-C twice, it will not work fine.

then you can resume the search by running this command again:

 scrapy crawl somespider --set JOBDIR=crawl1

kevin · Answer 2 · 2015-07-02T04:09:16+0000

Re: The main question that I have is the following: what is the common practice of using Scrapy while waiting for the scan to stop / resume several times (for example, when crawling on a very large website)?

If you do not want to use Scrapy pause / resume, you can always serialize your queries. The following is an example:

If you first scan 10,000 URLs and then clear those 10,000 URLs in a new crawler by querying in succession, you can simply serialize these URLs based on some rules and import csv into a spider:

 file = open('your10000_urls.csv', 'r') data = csv.reader(file) urls = list(data) url = [''] for i in urls: url.append(i[0]) start_urls = url[1:] file.close()

And then you can track these requests by dropping those that are already requested. In addition, you can store data in a database, which greatly simplifies life.

Hope this helps.

Boseam · Answer 3 · 2016-08-30T09:12:25+0000

The scrapy version I'm using is 1.1.0

you need to set the correct JOBDIR in settings.py

 JOBDIR = 'PROJECT_DIR'

After stopping the control+c spider, you can start the spider to continue cleaning the rest.

After that he should work

How does pause / resume Scrapy work?

More articles: