Scrapy monitors and cleans unresolved links

I have a CrawlSpider configured for the following specific links and clearing the news log, where links to each problem follow the following URL pattern:

http://example.com/YYYY/DDDD/index.htm , where YYYY is the year and DDDD is the problem number with three or four digits.

I only need problems 928 and my rules below. I have no problem connecting to the site, crawling or retrieving elements (so I did not include the rest of my code). The spider seems to have decided to follow illegal links. He tries to fix problems 377, 398 and more and follows the links "culture.htm" and "feature.htm". This causes a lot of errors and is not very important, but requires a lot of data cleaning. Any suggestions on what is going wrong?

class crawlerNameSpider(CrawlSpider): name = 'crawler' allowed_domains = ["example.com"] start_urls = ["http://example.com/issues.htm"] rules = ( Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True), Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ), Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ), ) 

EDIT: I fixed it using a much simpler regular expression 2009, 2010, 2011, but I'm still wondering why the above does not work if anyone has any suggestions.

+6
source share
1 answer

You need to pass the deny arguments to the SgmlLinkExtractor , which collects follow links. And you do not need to create so many Rule if they call one function parse_item . I would write your code as:

 rules = ( Rule(SgmlLinkExtractor( allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ), deny = ('culture\.htm', 'feature\.htm'), ), follow = True ), Rule(SgmlLinkExtractor( allow = ( 'fr[0-9].htm', 'eg[0-9]*.htm', 'ec[0-9]*.htm', 'op[0-9]*.htm', 'sc[0-9]*.htm', 're[0-9]*.htm', 'in[0-9]*.htm', ) ), callback = 'parse_item', ), ) 

If these are real url patterns in the rules you use for parse_item , this can be simplified:

  Rule(SgmlLinkExtractor( allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ), callback = 'parse_item', ), ) 
+8
source

Source: https://habr.com/ru/post/903926/


All Articles