Understanding Scraw CrawlSpider Rules

I am having trouble understanding how to use the rules box inside my own Spider, which inherits from CrawlSpider. My spider is trying to climb through the yellow pages for pizza in San Francisco.

I tried to keep my rules simple in order to make sure that the spider crawls any of the links in the answer, but I don’t see this happening. My only results are that it gives a request to the next page and then gives a request for the next page.

I have two questions: 1 .. Does the spider first process the rules before calling the callback when it receives a response? Or vice versa? 2. When do the rules apply?

EDIT: I get it. I tried the analysis method from CrawlSpider. Having studied the analysis method in this class, I realized that where it checks the rules and bypasses these sites.

NOTE. Know that you are redefining

Here is my code:

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy import Selector from yellowPages.items import YellowpagesItem from scrapy.http import Request class YellowPageSpider(CrawlSpider): name = "yellowpages" allowed_domains = ['www.yellowpages.com'] businesses = [] # start with one page start_urls = ['http://www.yellowpages.com/san-francisco-ca/pizza?g=san%20francisco%2C%20ca&q=pizza'] rules = (Rule (SgmlLinkExtractor() , callback="parse_items", follow= True), ) base_url = 'http://www.yellowpages.com' def parse(self, response): yield Request(response.url, callback=self.parse_business_listings_page) def parse_items(self, response): print "PARSE ITEMS. Visiting %s" % response.url return [] def parse_business_listings_page(self, response): print "Visiting %s" % response.url self.businesses.append(self.extract_businesses_from_response(response)) hxs = Selector(response) li_tags = hxs.xpath('//*[@id="main-content"]/div[4]/div[5]/ul/li') next_exist = False # Check to see if there a "Next". If there is, store the links. # If not, return. # This requires a linear search through the list of li_tags. Is there a faster way? for li in li_tags: li_text = li.xpath('.//a/text()').extract() li_data_page = li.xpath('.//a/@data-page').extract() # Note: sometimes li_text is an empty list so check to see if it is nonempty first if (li_text and li_text[0] == 'Next'): next_exist = True next_page_num = li_data_page[0] url = 'http://www.yellowpages.com/san-francisco-ca/pizza?g=san%20francisco%2C%20ca&q=pizza&page='+next_page_num yield Request(url, callback=self.parse_business_listings_page) 
+6
source share
1 answer


So, up to two questions.

  • before the request is completed, the crawler rules will be processed before making the request ... and, of course, if the answer does not match the allowed domain, the answer will be received theoretically, but simply discarded.

  • Crawler request rules are used before the request is executed.

ATTENTION!

In your example, when you call the parse () method ... although in your case you are using it correctly ?! You will need to run it to confirm, but the ones you are reading if your explicitly redefining parse () method is a CRAWL spider ... when using a traversal spider ... the parse_item () equivalent to a spider pair is ... () in the crawler is its own logical function ... USE AS A CALL IN RULESET should not be performed

https://doc.scrapy.org/en/latest/topics/spiders.html

+1
source

Source: https://habr.com/ru/post/974241/


All Articles