I am trying to clear search results from a website using the __doPostBack function. The web page displays 10 results for a search query. To see more results, you need to click the button that launches javascript __doPostBack . After some research, I realized that the POST request behaves exactly the same as the form, and that you can simply use scrapy FormRequest to fill out this form. I used the following thread:
Failures using scrapy using javascript __doPostBack method
to write the following script.
# -*- coding: utf-8 -*- from scrapy.contrib.spiders import CrawlSpider from scrapy.http import FormRequest from scrapy.http import Request from scrapy.selector import Selector from ahram.items import AhramItem import re class MySpider(CrawlSpider): name = u"el_ahram2" def start_requests(self): search_term = u'اقتصاد' baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1' requests = [] for i in range(1, 4):#crawl first 3 pages as a test argument = u"'Page$"+ str(i+1) + u"'" data = {'__EVENTTARGET': u"'GridView1'", '__EVENTARGUMENT': argument} currentPage = FormRequest(baseUrl, formdata = data, callback = self.fetch_articles) requests.append(currentPage) return requests def fetch_articles(self, response): sel = Selector(response) for ref in sel.xpath("//a[contains(@href,'checkpart.aspx?Serial=')]/@href").extract(): yield Request('http://digital.ahram.org.eg/' + ref, callback=self.parse_items) def parse_items(self, response): sel = Selector(response) the_title = ' '.join(sel.xpath("//title/text()").extract()).replace('\n','').replace('\r','').replace('\t','')#* mean 'anything' the_authors = '---'.join(sel.xpath("//*[contains(@id,'editorsdatalst_HyperLink')]//text()").extract()) the_text = ' '.join(sel.xpath("//span[@id='TextBox2']/text()").extract()) the_month_year = ' '.join(sel.xpath("string(//span[@id = 'Label1'])").extract()) the_day = ' '.join(sel.xpath("string(//span[@id = 'Label2'])").extract()) item = AhramItem() item["Authors"] = the_authors item["Title"] = the_title item["MonthYear"] = the_month_year item["Day"] = the_day item['Text'] = the_text return item
My problem is that 'fetch_articles' is never called:
2014-05-27 12:19:12+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None) 2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None) 2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None) 2014-05-27 12:19:13+0200 [el_ahram2] INFO: Closing spider (finished)
After a few days of searching, I feel completely stuck. I am new to python, so maybe the error is trivial. However, if this is not the case, this stream may be useful to many people. Thank you in advance for your help.
source share