Scraper with __doPostBack with url hidden link

I am trying to clear search results from a website using the __doPostBack function. The web page displays 10 results for a search query. To see more results, you need to click the button that launches javascript __doPostBack . After some research, I realized that the POST request behaves exactly the same as the form, and that you can simply use scrapy FormRequest to fill out this form. I used the following thread:

Failures using scrapy using javascript __doPostBack method

to write the following script.

 # -*- coding: utf-8 -*- from scrapy.contrib.spiders import CrawlSpider from scrapy.http import FormRequest from scrapy.http import Request from scrapy.selector import Selector from ahram.items import AhramItem import re class MySpider(CrawlSpider): name = u"el_ahram2" def start_requests(self): search_term = u'اقتصاد' baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1' requests = [] for i in range(1, 4):#crawl first 3 pages as a test argument = u"'Page$"+ str(i+1) + u"'" data = {'__EVENTTARGET': u"'GridView1'", '__EVENTARGUMENT': argument} currentPage = FormRequest(baseUrl, formdata = data, callback = self.fetch_articles) requests.append(currentPage) return requests def fetch_articles(self, response): sel = Selector(response) for ref in sel.xpath("//a[contains(@href,'checkpart.aspx?Serial=')]/@href").extract(): yield Request('http://digital.ahram.org.eg/' + ref, callback=self.parse_items) def parse_items(self, response): sel = Selector(response) the_title = ' '.join(sel.xpath("//title/text()").extract()).replace('\n','').replace('\r','').replace('\t','')#* mean 'anything' the_authors = '---'.join(sel.xpath("//*[contains(@id,'editorsdatalst_HyperLink')]//text()").extract()) the_text = ' '.join(sel.xpath("//span[@id='TextBox2']/text()").extract()) the_month_year = ' '.join(sel.xpath("string(//span[@id = 'Label1'])").extract()) the_day = ' '.join(sel.xpath("string(//span[@id = 'Label2'])").extract()) item = AhramItem() item["Authors"] = the_authors item["Title"] = the_title item["MonthYear"] = the_month_year item["Day"] = the_day item['Text'] = the_text return item 

My problem is that 'fetch_articles' is never called:

 2014-05-27 12:19:12+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None) 2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None) 2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None) 2014-05-27 12:19:13+0200 [el_ahram2] INFO: Closing spider (finished) 

After a few days of searching, I feel completely stuck. I am new to python, so maybe the error is trivial. However, if this is not the case, this stream may be useful to many people. Thank you in advance for your help.

+6
source share
2 answers

Your code is ok. fetch_articles works. You can test it by adding a print instruction. A.

However, the website requires you to check POST requests. To test them, you must have __EVENTVALIDATION and __VIEWSTATE in your request body to prove that you are responding to their form. To get them, you first need to make a GET request and extract these fields from the form. If you do not, you will instead get a page with an error that did not have links with "checkpart.aspx? Serial =", so the for loop did not execute.

This is how I set start_request and then fetch_search does start_request .

 class MySpider(CrawlSpider): name = u"el_ahram2" def start_requests(self): search_term = u'اقتصاد' baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1' SearchPage = Request(baseUrl, callback = self.fetch_search) return [SearchPage] def fetch_search(self, response): sel = Selector(response) search_term = u'اقتصاد' baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1' viewstate = sel.xpath("//input[@id='__VIEWSTATE']/@value").extract().pop() eventvalidation = sel.xpath("//input[@id='__EVENTVALIDATION']/@value").extract().pop() for i in range(1, 4):#crawl first 3 pages as a test argument = u"'Page$"+ str(i+1) + u"'" data = {'__EVENTTARGET': u"'GridView1'", '__EVENTARGUMENT': argument, '__VIEWSTATE': viewstate, '__EVENTVALIDATION': eventvalidation} currentPage = FormRequest(baseUrl, formdata = data, callback = self.fetch_articles) yield currentPage ... 
+2
source
  def fetch_articles(self, response): sel = Selector(response) print response._get_body() # you can write to file and do an grep for ref in sel.xpath("//a[contains(@href,'checkpart.aspx?Serial=')]/@href").extract(): yield Request('http://digital.ahram.org.eg/' + ref, callback=self.parse_items) 

I could not find the "checkpart.aspx? Serial =" you are looking for.

This may not solve your problem, but use the answer instead of the answer to format the code.

0
source

Source: https://habr.com/ru/post/969889/


All Articles