In what order are the rules ranked in CrawlSpider?

I have a question regarding the order in which rules are evaluated in CrawlSpider. If I have the code below:

from scrapy.contrib.spiders.crawl import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class MySpider(CrawlSpider): start_urls = ['http://someurlhere.com'] rules = ( Rule( LinkExtractor(restrict_xpaths=[ "//ul[@class='menu-categories']", "//ul[@class='menu-subcategories']"]), callback='first_callback' ), Rule( LinkExtractor(allow='/product.php?id=\d+'), callback='second_callback' ) ) 

In this case:

  • The engine will send the request 'http://someurlhere.com' to the start_url list and call the default parse callback when it receives a response.
  • Then, in the parse method, the response from this step will extract the links from this response based on the xpath that we provided to FIRST LinkExtractor.

Now my question is links that are extracted from the FIRST LinkExtractor , are they just planned in the scheduler and not executed immediately? So after he has painted all the links that are extracted from the first LinkExtractor , then he will call the first_callback method for all these links with the response passed to that first_callback ?

Also, when will the second LinkExtractor be called? Is the first LinkExtractor and then only the second LinkExtractor ?

+2
source share
1 answer

If we look at the official documentation . The process is simple.

First, your start URL is parsed, and then each subsequent crawl link will be retrieved according to the rules provided.

Now to your question.

Now my question is, the links that are extracted from the FIRST LinkExtractor Rule, are they just scheduled in the scheduler and not immediately followed? Therefore, after he plans all the links that are extracted from the first LinkExtractor, then he will call the first_callback method for all these links with the transmitted response to this first_callback?

If the None callback matches the default values โ€‹โ€‹of True , otherwise it defaults to False . This means that in your case there will be no continuation. Regardless of which link was extracted from the response to the starting URL, this is what you will have in the scheduler, and your crawl will end after parsing all of these.

If you want to follow, break the rules. Find where your content is and where the resources are.

 # Extract links matching 'products' (but not matching 'shampoo') # and follow links from them (since no callback means follow=True by default). Rule(LinkExtractor(allow=('products', ), deny=('shampoo', ))), # Extract links matching 'item' and parse them with the spider method parse_item Rule(LinkExtractor(allow=('item', )), callback='parse_item'), 

Now, going to your second question:

Also, when will the second LinkExtractor be called? Does first LinkExtractor get an estimate, and then only the second LinkExtractor works?

One is independent of the other. LinkExtractor The object applies regular expression or string matching independently. If they find a suitable URL, they will execute their callbacks or keep track of them.

+1
source

Source: https://habr.com/ru/post/973926/


All Articles