The rules attribute for CrawlSpider indicates how to retrieve links from the page and which calls should be called for these links. They are handled by the standard parse() method implemented in this class - see here to read the source .
So, whenever you want to initiate rules for a URL, you just need to provide scrapy.Request(url, self.parse) , and the Scrapy mechanism will send a request to that URL and apply the rules to the response.
Retrieving links (which restrict_xpaths may or may not use) is performed by the LinkExtractor registered for this rule. It basically searches for all <a> and <area> elements in the entire page, or only in elements obtained after applying restrict_xpaths expressions, if the attribute is set.
Example:
For example, let's say you have a CrawlSpider, for example:
from scrapy.contrib.spiders.crawl import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class MySpider(CrawlSpider): start_urls = ['http://someurlhere.com'] rules = ( Rule( LinkExtractor(restrict_xpaths=[ "//ul[@class='menu-categories']", "//ul[@class='menu-subcategories']"]), callback='parse' ), Rule( LinkExtractor(allow='/product.php?id=\d+'), callback='parse_product_page' ), ) def parse_product_page(self, response):
The engine starts sending requests to URLs in start_urls and makes a default callback (the parse() method in CrawlSpider) to respond.
For each answer, the parse () method will execute a means of extracting links on it to get links from the page. Namely, it calls LinkExtractor.extract_links(response) for each response object to obtain URLs, and then gives scrapy.Request(url, <rule_callback>) objects scrapy.Request(url, <rule_callback>) .
An example code is a spider skeleton that scans an e-commerce site for product categories and subcategory links to get links for each product page.
For rules registered specifically for this spider, it will crawl links inside the lists of "categories" and "subcategories" using the parse() method as a callback (which will cause crawl rules that will be called for these pages), and links corresponding to to the regular expression product.php?id=\d+ with the parse_product_page() - which will finally clear the product data.
As you can see, quite powerful stuff. =)
More details: