Scrapy gives URLError: <urlopen error timed out>

Question

Scrapy gives URLError: <urlopen error timed out>

So, I have a scripting program that I'm trying to get off my feet, but I can’t get my code to execute it, the error below always appears.

I can still visit the site using the scrapy shell command scrapy shell that I know the URL and everything works.

Here is my code

 from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from Malscraper.items import MalItem class MalSpider(CrawlSpider): name = 'Mal' allowed_domains = ['www.website.net'] start_urls = ['http://www.website.net/stuff.php?'] rules = [ Rule(LinkExtractor( allow=['//*[@id="content"]/div[2]/div[2]/div/span/a[1]']), callback='parse_item', follow=True) ] def parse_item(self, response): mal_list = response.xpath('//*[@id="content"]/div[2]/table/tr/td[2]/') for mal in mal_list: item = MalItem() item['name'] = mal.xpath('a[1]/strong/text()').extract_first() item['link'] = mal.xpath('a[1]/@href').extract_first() yield item

Edit: Here is the trace.

 Traceback (most recent call last): File "C:\Users\2015\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url r = opener.open(req, timeout=timeout) File "C:\Users\2015\Anaconda\lib\urllib2.py", line 431, in open response = self._open(req, data) File "C:\Users\2015\Anaconda\lib\urllib2.py", line 449, in _open '_open', req) File "C:\Users\2015\Anaconda\lib\urllib2.py", line 409, in _call_chain result = func(*args) File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1227, in http_open return self.do_open(httplib.HTTPConnection, req) File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1197, in do_open raise URLError(err) URLError: <urlopen error timed out>

Edit2:

So, with the scrapy shell command I can manipulate my answers, but I just noticed that the same exact error appears when visiting the site.

Edit3:

Now I have found that an error appears on EVERY website that I use shell command , but I can still manipulate the response.

Edit4: So, how can I verify that I get a response from Scrapy when I run the crawl command ? Now I don’t know if its my code is the reason why my logs look empty or an error?

Here are my settings.py

 BOT_NAME = 'Malscraper' SPIDER_MODULES = ['Malscraper.spiders'] NEWSPIDER_MODULE = 'Malscraper.spiders' FEED_URI = 'logs/%(name)s/%(time)s.csv' FEED_FORMAT = 'csv'

+6

python web-scraping scrapy

grasshopper Jun 25 '15 at 10:44

source share

3 answers

you can also remove boto from additional packages by adding:

 from scrapy import optional_features optional_features.remove('boto')

as suggested in this issue

+4

guilhermerama Nov 08 '15 at 1:29

source share

This is very annoying. What happens is that you have Null credentials, and boto decides to populate them for you from the metadata server (if one exists) using _populate_keys_from_metadata_server() . See here and here . If you do not start the EC2 instance or something that starts the metadata server (listening in automatic magic IP: 169.254.169.254), try timeouts. It was normal and quiet since scrapy handles the exception here , but unfortunately boto started logging it here , thus, an annoying message. Besides disabling s3, as mentioned earlier ..., which looks a bit scary, you can achieve similar results by simply setting the credentials on an empty line.

 AWS_ACCESS_KEY_ID = "" AWS_SECRET_ACCESS_KEY = ""

+1

neverlastn Jul 12 '15 at 18:03

source share

José ricardo · Accepted Answer · 2015-06-25T15:46:32+0000

The screening issue is open in this issue: https://github.com/scrapy/scrapy/issues/1054

Although this is like a warning on other platforms.

You can disable S3DownloadHandler (causing this error) by adding your cleanup settings:

 DOWNLOAD_HANDLERS = { 's3': None, }

Scrapy gives URLError: <urlopen error timed out>

More articles: