I am having trouble understanding how to use the rules box inside my own Spider, which inherits from CrawlSpider. My spider is trying to climb through the yellow pages for pizza in San Francisco.
I tried to keep my rules simple in order to make sure that the spider crawls any of the links in the answer, but I donβt see this happening. My only results are that it gives a request to the next page and then gives a request for the next page.
I have two questions: 1 .. Does the spider first process the rules before calling the callback when it receives a response? Or vice versa? 2. When do the rules apply?
EDIT: I get it. I tried the analysis method from CrawlSpider. Having studied the analysis method in this class, I realized that where it checks the rules and bypasses these sites.
NOTE. Know that you are redefining
Here is my code:
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy import Selector from yellowPages.items import YellowpagesItem from scrapy.http import Request class YellowPageSpider(CrawlSpider): name = "yellowpages" allowed_domains = ['www.yellowpages.com'] businesses = []
source share