HTTP 403 Responses When Using Python Scrapy

I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I tested the following Scrapy code to recursively scrape all the pages on www.whoscored.com, which are for football statistics:

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from scrapy.item import Item from scrapy.spider import BaseSpider from scrapy import log from scrapy.cmdline import execute from scrapy.utils.markup import remove_tags class ExampleSpider(CrawlSpider): name = "goal3" allowed_domains = ["whoscored.com"] start_urls = ["http://www.whoscored.com/"] rules = [Rule(SgmlLinkExtractor(allow=()), follow=True), Rule(SgmlLinkExtractor(allow=()), callback='parse_item') ] def parse_item(self,response): self.log('A response from %s just arrived!' % response.url) scripts = response.selector.xpath("normalize-space(//title)") for scripts in scripts: body = response.xpath('//p').extract() body2 = "".join(body) print remove_tags(body2).encode('utf-8') execute(['scrapy','crawl','goal3']) 

The code is executed without any errors, however, from 4,623 pages scrabbled, 217 received an HTTP 200 response code, 2 received a 302 code, and 4404 received a 403 response. Can anyone see something immediately obvious in the code that it could be ? Could this be an anti-scraping measure from the site? Is it common practice to slow down the number of statements made to stop this?

thanks

+7
source share
2 answers

An HTTP 403 status code definitely means Denied / Denied access .
The HTTP 302 status code is for redirecting requests. No need to worry about them. There is nothing wrong with the code.

Yes, this is definitely a smoothing measure implemented on the site.

Refer to these recommendations from the Scrapy document: Avoid Using Prohibited

In addition, you should consider pausing and resuming a crawl .

+13
source

Not if it's still available, but I have to put the following lines in the setting.py file:

 HTTPERROR_ALLOWED_CODES =[404] USER_AGENT = 'quotesbot (+http://www.yourdomain.com)' USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" 

Hope it helps.

0
source

Source: https://habr.com/ru/post/972421/


All Articles