I wrote a scraw crawlspider to crawl a site with a structure such as a category page> type page> list page> element page. There are many categories of machines on the category page, each of which has a type page with many types, each of which has a list of elements, and then, finally, each machine has a page with information about this.
My spider has a rule to go from the home page to the category page, where I define callback parse detection, this generates an element, captures a category and gives a new request for each category on the page. I pass the element and category name with request.meta and indicate that the callback is a score page.
Parsetypepage gets the element from response.meta, then gives requests for each type and passes the element, as well as the concatenation of the category and type with it, to request.meta. The callback is a parsemachinelist.
The parsemachinelist receives an element from response.meta, then gives requests for each element in the list and passes the element, category / type, description via request.meta for the final callback, parsemachine. This gets the meta attributes and fills all the fields in the element, using the information on the page and the information that was transferred from previous pages, and finally gives the element.
If I restrict this to one category and type (for example, contains[@href, "filter=c:Grinders"] and contains[@href, "filter=t:Disc+-+Horizontal%2C+Single+End"] ), then it works and on each machine on the final page there is a machine element. The problem is that as soon as I allow the spider to clear all categories and all types, it returns only the elements for treating machines on the first of the last pages that it gets to, and after it has done that the spider is finished and not get other categories etc.
Here is the (anonymous) code
from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.http import Request from myspider.items import MachineItem import urlparse class MachineSpider(CrawlSpider): name = 'myspider' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/index.php'] rules = ( Rule(SgmlLinkExtractor(allow_domains=('example.com'),allow=('12\.html'),unique=True),callback='parsecatpage'), ) def parsecatpage(self, response): hxs = HtmlXPathSelector(response)
So, for example, the spider will find Grinders on the category page and go to a page of type Grinder where it will find the type of horizontal single end of the disk, then it will go to this page and find the list of machines and go to each page of machines and finally there will be an element for each cars. If you try to switch to grinding machines and lathes, although it will work through a grinding machine, it will scan pages of types of lathes and lathes and stop there without generating requests for a list of lathes and the final pages of lathes.
Can anyone help with this? Why does the spider not fall on the second (or third, etc.) Page of the machine list when there are more than one category of machines?
Sorry for the epic post just trying to explain the problem !!
Thanks!!