I am trying to crawl sites in a very simple way. But Scrapy does not crawl all links. I will explain the scenario as follows:
main_page.html β contains links to a_page.html, b_page.html, c_page.html
a_page.html β contains links to a1_page.html, a2_page.html
b_page.html β contains links to b1_page.html, b2_page.html
c_page.html β contains links to c1_page.html, c2_page.html
a1_page.html β contains a link to b_page.html
a2_page.html -> contains a link to c_page.html
b1_page.html β contains a link to a_page.html
b2_page.html -> contains a link to c_page.html
c1_page.html -> contains a link to a_page.html
c2_page.html -> contains a link to main_page.html
I use the following rule in CrawlSpider -
Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))
But the scan results are as follows:
DEBUG: Crawled (200) http: //localhost/main_page.html> (referent: No) 2011-12-05 09: 56: 07 + 0530 [test_spider] DEBUG: Crawled (200) http: // localhost / a_page. html> (referent: http: //localhost/main_page.html ) 2011-12-05 09: 56: 07 + 0530 [test_spider] DEBUG: Crawled (200) http: //localhost/a1_page.html> (referent: http : //localhost/a_page.html ) 2011-12-05 09: 56: 07 + 0530 [test_spider] DEBUG: Crawled (200) http: //localhost/b_page.html> (referent: http: // localhost / a1_page .html ) 2011-12-05 09: 56: 07 + 0530 [test_spider] DEBUG: Crawled (200) http: //localhost/b1_page.html> (referent: http: //localhost/b_page.html ) 2011-12 -05 09: 56: 07 + 0530 [test_spider] INFO: Closing spider (finished)
It does not crawl all pages.
NB - I made a detour in the BFO, as indicated in the Scrapy Doc.
What am I missing?
source share