I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I tested the following Scrapy code to recursively scrape all the pages on www.whoscored.com, which are for football statistics:
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from scrapy.item import Item from scrapy.spider import BaseSpider from scrapy import log from scrapy.cmdline import execute from scrapy.utils.markup import remove_tags class ExampleSpider(CrawlSpider): name = "goal3" allowed_domains = ["whoscored.com"] start_urls = ["http://www.whoscored.com/"] rules = [Rule(SgmlLinkExtractor(allow=()), follow=True), Rule(SgmlLinkExtractor(allow=()), callback='parse_item') ] def parse_item(self,response): self.log('A response from %s just arrived!' % response.url) scripts = response.selector.xpath("normalize-space(//title)") for scripts in scripts: body = response.xpath('//p').extract() body2 = "".join(body) print remove_tags(body2).encode('utf-8') execute(['scrapy','crawl','goal3'])
The code is executed without any errors, however, from 4,623 pages scrabbled, 217 received an HTTP 200 response code, 2 received a 302 code, and 4404 received a 403 response. Can anyone see something immediately obvious in the code that it could be ? Could this be an anti-scraping measure from the site? Is it common practice to slow down the number of statements made to stop this?
thanks
source share