I have a CrawlSpider configured for the following specific links and clearing the news log, where links to each problem follow the following URL pattern:
http://example.com/YYYY/DDDD/index.htm , where YYYY is the year and DDDD is the problem number with three or four digits.
I only need problems 928 and my rules below. I have no problem connecting to the site, crawling or retrieving elements (so I did not include the rest of my code). The spider seems to have decided to follow illegal links. He tries to fix problems 377, 398 and more and follows the links "culture.htm" and "feature.htm". This causes a lot of errors and is not very important, but requires a lot of data cleaning. Any suggestions on what is going wrong?
class crawlerNameSpider(CrawlSpider): name = 'crawler' allowed_domains = ["example.com"] start_urls = ["http://example.com/issues.htm"] rules = ( Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True), Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'), Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ), Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ), )
EDIT: I fixed it using a much simpler regular expression 2009, 2010, 2011, but I'm still wondering why the above does not work if anyone has any suggestions.