Spider spider doesn't work

Since nothing works, I started a new project with

python scrapy-ctl.py startproject Nu 

I definitely followed the tutorial and created folders, and a new spider

 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from Nu.items import NuItem from urls import u class NuSpider(CrawlSpider): domain_name = "wcase" start_urls = ['http://www.whitecase.com/aabbas/'] names = hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+') u = names.pop() rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),) def parse(self, response): self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) item = Item() item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)') return item SPIDER = NuSpider() 

and when i started

 C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase 

I get

 [Nu] ERROR: Could not find spider for domain: wcase 

Other spiders are at least recognized by Scrapy, this is not. What am I doing wrong?

Thanks for your help!

-1
source share
5 answers

Please also check the treatment version. In the latest version, instead of the name "domain_name", the name "name" is used to uniquely identify the spider.

+6
source

These two lines look as if they are causing problems:

 u = names.pop() rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),) 
  • Each time the script is run, only one rule will be executed. Consider creating a rule for each URL.
  • You did not create a parse_item , which means the rule does nothing. The only callback you defined is parse , which changes the default behavior of the spider.

In addition, here are some things to watch out for.

  • CrawlSpider does not like to overload its default parse method. Find parse_start_url in the documentation or docs. You will see that this is the preferred way to override the default parse method for your source URLs.
  • NuSpider.hxs is called before it is determined.
+3
source

Have you included the spider in the SPIDER_MODULES list in your scrapy_settings.py?

It is not written anywhere in the textbook that you need, but you need.

+2
source

I believe there are syntax errors. name = hxs... will not work because you will not get it before the hxs object.

Try running python yourproject/spiders/domain.py to get syntax errors.

+2
source

You override the parse method instead of implementing the new parse_item method.

+2
source

Source: https://habr.com/ru/post/1310581/


All Articles