Problems using scrapy using javascript __doPostBack

Attempts to automatically capture search results from a general search, but encountering some problems. The URL has the form

http://www.website.com/search.aspx?keyword=#&&page=1&sort=Sorting 

When I click the pages, after visiting this page it changes slightly to

 http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page=2 

The problem is that if I then try to directly visit the second link without first visiting the first link, I am redirected to the first link. My current attempt is to define a long list of start_urls in scrapy.

 class websiteSpider(BaseSpider): name = "website" allowed_domains = ["website.com"] baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page=" start_urls = [(baseUrl+str(i)) for i in range(1,1000)] 

Currently, this code simply ends up visiting the first page again and again. I feel that this is probably easy, but I don’t quite understand how to get around this.

UPDATE: I made some progress exploring this and found that the site refreshes each page by sending a POST request to the previous page using __doPostBack (arg1, arg2). Now my question is how exactly do I simulate this POST request using scrapy. I know how to make a POST request, but not exactly how to pass the arguments I want to it.

SECOND UPDATE: I have made great progress! I think ... I looked at the examples and documentation and eventually hit this version of what I think the trick should do:

 def start_requests(self): baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page=" target = 'ctl00$empcnt$ucResults$pagination' requests = [] for i in range(1, 5): url = baseUrl + str(i) argument = str(i+1) data = {'__EVENTTARGET': target, '__EVENTARGUMENT': argument} currentPage = FormRequest(url, data) requests.append(currentPage) return requests 

The idea is that this processes the POST request in the same way as the form and updates accordingly. However, when I actually try to run this, I get the following trace (shortened for brevity):

 2013-03-22 04:03:03-0400 [guru] ERROR: Unhandled error on engine.crawl() dfd.addCallbacks(request.callback or spider.parse, request.errback) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 280, in addCallbacks assert callable(callback) exceptions.AssertionError: 2013-03-22 04:03:03-0400 [-] ERROR: Unhandled error in Deferred: 2013-03-22 04:03:03-0400 [-] Unhandled Error Traceback (most recent call last): Failure: scrapy.exceptions.IgnoreRequest: Skipped (request already seen) 

Modifying the question to be more focused on what has turned into this post.

Thoughts?

PS When other errors occur, scrapy cannot complete the shutdown, and I have to send SIGINT twice in order to actually be able to shut down.

+2
source share
1 answer

FormRequest has no positional argument in the constructor for formdata :

 class FormRequest(Request): def __init__(self, *args, **kwargs): formdata = kwargs.pop('formdata', None) 

so you really have to say formdata= :

 requests.append(FormRequest(url, formdata=data)) 
+2
source

Source: https://habr.com/ru/post/969892/


All Articles