Be a good citizen and web scraping

Question

Be a good citizen and web scraping

I have a two-part question.

Firstly, I am writing a web scraper based on CrawlSpider in Scrapy. I am going to clear a website that contains thousands (possibly hundreds of thousands) of entries. These entries are buried 2-3 layers down from the start page. So basically I have a spider launch on a specific page, scan until it finds a specific post type, and then parse the html. I wonder what methods exist to prevent my spider from overloading the site? Is there a way to do something in stages or to pause between different requests?

Secondly, and related, is there a Scrapy method to check the crawler without excessive stress on the site? I know that you can kill a program during its launch, but is there a way to stop the script after hitting something like the first page with the information I want to clear?

Any advice or resources are welcome.

+6

python scrapy screen-scraping

user1074057 Dec 17 '11 at 4:18

source share

2 answers

You need to start scanning and record everything. If you are blocked, you can add sleep () before requesting pages.

Modifying the User-Agent is also good practice (http://www.user-agents.org/ http://www.useragentstring.com/ )

If you are not allowed ip, use a proxy server to bypass it. Greetings.

-2

Kirill Malev Dec 17 '11 at 5:42

source share

reclosedev · Accepted Answer · 2011-12-17T06:40:15+0000

Is there a way to do a thing gradually

I use Scrapy's caching ability to speed up the site

HTTPCACHE_ENABLED = True

Or you can use the new 0.14 Jobs feature : pause and resume crawl

or pause between different requests?

check these settings:

DOWNLOAD_DELAY RANDOMIZE_DOWNLOAD_DELAY

Is there a Scrapy method to test the crawler without undue stress on the site?

You can try and debug your code in the Scrapy shell

I know that you can kill a program during its launch, but is there a way to stop the script after hitting something like the first page with the information I want to clear?

Alternatively, you can call scrapy.shell.inspect_response at any time in your spider.

Any advice or resources are welcome.

Therapy documentation is the best resource.

Be a good citizen and web scraping

More articles: