I have a two-part question.
Firstly, I am writing a web scraper based on CrawlSpider in Scrapy. I am going to clear a website that contains thousands (possibly hundreds of thousands) of entries. These entries are buried 2-3 layers down from the start page. So basically I have a spider launch on a specific page, scan until it finds a specific post type, and then parse the html. I wonder what methods exist to prevent my spider from overloading the site? Is there a way to do something in stages or to pause between different requests?
Secondly, and related, is there a Scrapy method to check the crawler without excessive stress on the site? I know that you can kill a program during its launch, but is there a way to stop the script after hitting something like the first page with the information I want to clear?
Any advice or resources are welcome.
source share