ScrapyJS - How to properly wait for a page to load?

I use ScrapyJS and Splash to simulate a form submit button.

def start_requests(self): script = """ function main(splash) assert(splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")) assert(splash:go(splash.args.url)) local js = [[ var $j = jQuery.noConflict(); $j('#USER').val('frankcastle'); $j('#password').val('punisher'); $j('.button-oblong-orange.button-orange a').click(); ]] assert(splash:runjs(js)) local resumeJs = [[ function main(splash) { var $j = jQuery.noConflict(); $j(document).ready(function(){ splash.resume(); }) } ]] assert(splash:wait_for_resume(resumeJs)) return { html = splash:html() } end """ splash_meta = {'splash': {'endpoint': 'execute', 'args': {'wait': 0.5, 'lua_source': script}}} for url in self.start_urls: yield scrapy.Request(url, self.after_login, meta=splash_meta) def after_login(self, response): print response.body return 

After running splash:runjs(js) , I resort to splash:wait(5) by trying splash:wait_for_resume to get the result. This may not always work (network latency), so is there a better way?

+5
source share
3 answers

It turns out that the only way is to use splash:wait() , but do it in a loop and check for the presence of some element (for example, a footer).

 def start_requests(self): script = """ function main(splash) assert(splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")) assert(splash:go(splash.args.url)) local js = [[ var $j = jQuery.noConflict(); $j('#USER').val('frankcastle'); $j('#password').val('punisher'); $j('.button-oblong-orange.button-orange a').click(); $j('body').empty() // clear body, otherwise the wait_for footer will always be true ]] assert(splash:runjs(js)) function wait_for(splash, condition) while not condition() do splash:wait(0.05) end end wait_for(splash, function() return splash:evaljs("document.querySelector('#footer') != null") end) return { html = splash:html() } end """ splash_meta = {'splash': {'endpoint': 'execute', 'args': {'wait': 0.5, 'lua_source': script}}} for url in self.start_urls: yield scrapy.Request(url, self.after_login, meta=splash_meta) 
+4
source

So, I haven't played with this yet (got Lua and some successful attempts with Splash just today).

if you do something like this:

 recheck = True html = splash:html() splash:wait(0.5) while recheck = True: splash:wait(0.5) html2 = splash:html() if html != html2: pass elif: recheck = False return { html = splash:html(), } 

We are going to use a similar thing for endless scroll pages that populate list items in response to scrolls (or Page_downs)

Sorry for not knowing the Lua / Splash syntax

0
source

There is a slightly better way to test this, but you need a wait loop nonetheless. The idea is to use splash:on_response(response) as a callback when refreshing the page. Note that the response callback will be called by async, so the main loop should wait for all page changes, so we have a wait loop (for example, given by @Krishnaraj).

Below is an example of pressing the button button_id 10 times to download additional content.

 function main(splash) assert(splash:go(splash.args.url)) function wait_for(splash, condition) while not condition() do splash:wait(0.2) end end local clicks = 0 splash:on_response(function(res) clicks = clicks + 1 if clicks < 10 then assert(splash:runjs("document.getElementById(\"button_id\").click();")) end end) assert(splash:runjs("document.getElementById(\"button_id\").click();")) wait_for(splash, function() return clicks >= 10 end) return splash:html() end 
0
source

Source: https://habr.com/ru/post/1246398/


All Articles