Scrapy: crawlspider does not generate all links in nested callbacks

I wrote a scraw crawlspider to crawl a site with a structure such as a category page> type page> list page> element page. There are many categories of machines on the category page, each of which has a type page with many types, each of which has a list of elements, and then, finally, each machine has a page with information about this.

My spider has a rule to go from the home page to the category page, where I define callback parse detection, this generates an element, captures a category and gives a new request for each category on the page. I pass the element and category name with request.meta and indicate that the callback is a score page.

Parsetypepage gets the element from response.meta, then gives requests for each type and passes the element, as well as the concatenation of the category and type with it, to request.meta. The callback is a parsemachinelist.

The parsemachinelist receives an element from response.meta, then gives requests for each element in the list and passes the element, category / type, description via request.meta for the final callback, parsemachine. This gets the meta attributes and fills all the fields in the element, using the information on the page and the information that was transferred from previous pages, and finally gives the element.

If I restrict this to one category and type (for example, contains[@href, "filter=c:Grinders"] and contains[@href, "filter=t:Disc+-+Horizontal%2C+Single+End"] ), then it works and on each machine on the final page there is a machine element. The problem is that as soon as I allow the spider to clear all categories and all types, it returns only the elements for treating machines on the first of the last pages that it gets to, and after it has done that the spider is finished and not get other categories etc.

Here is the (anonymous) code

 from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.http import Request from myspider.items import MachineItem import urlparse class MachineSpider(CrawlSpider): name = 'myspider' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/index.php'] rules = ( Rule(SgmlLinkExtractor(allow_domains=('example.com'),allow=('12\.html'),unique=True),callback='parsecatpage'), ) def parsecatpage(self, response): hxs = HtmlXPathSelector(response) #this works, next line doesn't categories = hxs.select('//a[contains(@href, "filter=c:Grinders")]') categories = hxs.select('//a[contains(@href, "filter=c:Grinders") or contains(@href, "filter=c:Lathes")]') for cat in categories: item = MachineItem() req = Request(urlparse.urljoin(response.url,''.join(cat.select("@href").extract()).strip()),callback=self.parsetypepage) req.meta['item'] = item req.meta['machinecategory'] = ''.join(cat.select("./text()").extract()) yield req def parsetypepage(self, response): hxs = HtmlXPathSelector(response) #this works, next line doesn't types = hxs.select('//a[contains(@href, "filter=t:Disc+-+Horizontal%2C+Single+End")]') types = hxs.select('//a[contains(@href, "filter=t:Disc+-+Horizontal%2C+Single+End") or contains(@href, "filter=t:Lathe%2C+Production")]') for typ in types: item = response.meta['item'] req = Request(urlparse.urljoin(response.url,''.join(typ.select("@href").extract()).strip()),callback=self.parsemachinelist) req.meta['item'] = item req.meta['machinecategory'] = ': '.join([response.meta['machinecategory'],''.join(typ.select("./text()").extract())]) yield req def parsemachinelist(self, response): hxs = HtmlXPathSelector(response) for row in hxs.select('//tr[contains(td/a/@href, "action=searchdet")]'): item = response.meta['item'] req = Request(urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(@href,"action=searchdet")]/@href').extract()).strip()),callback=self.parsemachine) print urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(@href,"action=searchdet")]/@href').extract()).strip()) req.meta['item'] = item req.meta['descr'] = row.select('./td/div/text()').extract() req.meta['machinecategory'] = response.meta['machinecategory'] yield req def parsemachine(self, response): hxs = HtmlXPathSelector(response) item = response.meta['item'] item['machinecategory'] = response.meta['machinecategory'] item['comp_name'] = 'Name' item['description'] = response.meta['descr'] item['makemodel'] = ' '.join([''.join(hxs.select('//table/tr[contains(td/strong/text(), "Make")]/td/text()').extract()),''.join(hxs.select('//table/tr[contains(td/strong/text(), "Model")]/td/text()').extract())]) item['capacity'] = hxs.select('//tr[contains(td/strong/text(), "Capacity")]/td/text()').extract() relative_image_url = hxs.select('//img[contains(@src, "custom/modules/images")]/@src')[0].extract() abs_image_url = urlparse.urljoin(response.url, relative_image_url.strip()) item['image_urls'] = [abs_image_url] yield item SPIDER = MachineSpider() 

So, for example, the spider will find Grinders on the category page and go to a page of type Grinder where it will find the type of horizontal single end of the disk, then it will go to this page and find the list of machines and go to each page of machines and finally there will be an element for each cars. If you try to switch to grinding machines and lathes, although it will work through a grinding machine, it will scan pages of types of lathes and lathes and stop there without generating requests for a list of lathes and the final pages of lathes.

Can anyone help with this? Why does the spider not fall on the second (or third, etc.) Page of the machine list when there are more than one category of machines?

Sorry for the epic post just trying to explain the problem !!

Thanks!!

+4
source share
2 answers

You must type the request URL to verify this. You can also try this version:

 def parsecatpage(self, response): hxs = HtmlXPathSelector(response) categories = hxs.select('//a[contains(@href, "filter=c:Grinders") or contains(@href, "filter=c:Lathes")]') for cat in categories: item = MachineItem() cat_url = urlparse.urljoin(response.url, cat.select("./@href").extract()[0]) print 'url:', cat_url # to see what there cat_name = cat.select("./text()").extract()[0] req = Request(cat_url, callback=self.parsetypepage, meta={'item': item, 'machinecategory': cat_name}) yield req 
+2
source

The problem was that the website was set up so that the transition from category to page type (and subsequent pages) occurs by filtering the results that are shown. This means that if queries perform depth first at the bottom of the query, then it works (i.e., selects a category, then gets all types of this category, then gets all the machines in each type, and then clears the page of each machine) but if the request for a page of the next type, it is processed before the spider receives the URLs for each machine in the first type, then the URLs cease to be correct, and the spider reaches the wrong page and cannot extract information for the next step.

To solve the problem, I defined a category setting callback that is called only for the first time and gets a list of all categories called categories , then a category callback that is called from the category setting, which starts crawling with one category using only categories.pop() . Once the spider hit the bottom of the nested callbacks and cleared all the machines in the list, the callback is returned back to the category callback ( dont_follow=True in Request is required), where categories.pop() starts the process again with the next category in the list, while will not be executed. Thus, each category is fully considered until the next launch, and it works.

Thank you for your last comment, which made me think the right way and led me to a solution!

0
source

Source: https://habr.com/ru/post/1401408/


All Articles