Scrapy does not crawl all pages

I am trying to crawl sites in a very simple way. But Scrapy does not crawl all links. I will explain the scenario as follows:

main_page.html β†’ contains links to a_page.html, b_page.html, c_page.html
a_page.html β†’ contains links to a1_page.html, a2_page.html
b_page.html β†’ contains links to b1_page.html, b2_page.html
c_page.html β†’ contains links to c1_page.html, c2_page.html
a1_page.html β†’ contains a link to b_page.html
a2_page.html -> contains a link to c_page.html
b1_page.html β†’ contains a link to a_page.html
b2_page.html -> contains a link to c_page.html
c1_page.html -> contains a link to a_page.html
c2_page.html -> contains a link to main_page.html

I use the following rule in CrawlSpider -

Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

But the scan results are as follows:

DEBUG: Crawled (200) http: //localhost/main_page.html> (referent: No) 2011-12-05 09: 56: 07 + 0530 [test_spider] DEBUG: Crawled (200) http: // localhost / a_page. html> (referent: http: //localhost/main_page.html ) 2011-12-05 09: 56: 07 + 0530 [test_spider] DEBUG: Crawled (200) http: //localhost/a1_page.html> (referent: http : //localhost/a_page.html ) 2011-12-05 09: 56: 07 + 0530 [test_spider] DEBUG: Crawled (200) http: //localhost/b_page.html> (referent: http: // localhost / a1_page .html ) 2011-12-05 09: 56: 07 + 0530 [test_spider] DEBUG: Crawled (200) http: //localhost/b1_page.html> (referent: http: //localhost/b_page.html ) 2011-12 -05 09: 56: 07 + 0530 [test_spider] INFO: Closing spider (finished)

It does not crawl all pages.

NB - I made a detour in the BFO, as indicated in the Scrapy Doc.

What am I missing?

+4
source share
3 answers

I had a similar problem today, although I used my own spider. It turned out that the site limits my crawl because my useragent was scrappy-bot

try changing your user agent and try again. Change it possibly to a known browser

Another thing you can try is to add a delay. Some websites prevent cleanup if the time between requests is too short. Try adding DOWNLOAD_DELAY of 2 and see if this helps.

Additional information about DOWNLOAD_DELAY at http://doc.scrapy.org/en/0.14/topics/settings.html

+3
source

Scrapy will filter all duplicate requests by default.

You can get around this using (example):

 yield Request(url="test.com", callback=self.callback, dont_filter = True) 

dont_filter (boolean) - indicates that this request should not be filtered by the scheduler. This is used when you want to execute an identical query multiple times to ignore the duplicate filter. use this with caution, or you will fall into bypass loops. Default is False.

Also see Request for object documentation.

+5
source

Perhaps many URLs are duplicates. Scrapy avoids duplication because it is ineffective. From what I see from your explanation, since you are using the URL rule, of course, there are many duplicates.

If you want to be sure and see the evidence in the log, add it to your settings.py .

 DUPEFILTER_DEBUG = True 

And you will see such lines in the log:

2016-09-20 17:08:47 [scrapy] DEBUG: Filtered duplicate request: http://www.example.org/example.html>

0
source

Source: https://habr.com/ru/post/1384641/


All Articles