Adding wait for an element when executing SplashRequest in python Scrapy

I am trying to clean up several dynamic sites using Splash for Scrapy in python. However, I see that Splash cannot wait to load a full page in certain cases. Most likely, a long wait time was added to solve this problem (for example, 5 seconds in the fragment below). However, this is extremely inefficient and still does not load certain data (sometimes loading takes more than 5 seconds). Is there any waiting condition for an element that can be met through these queries?

 yield SplashRequest( url, self.parse, args={'wait': 5}, 'User-Agent':"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36", } ) 
+11
source share
3 answers

Yes, you can write a Lua script for this. Something like that:

 function main(splash) splash:set_user_agent(splash.args.ua) assert(splash:go(splash.args.url)) -- requires Splash 2.3 while not splash:select('.my-element') do splash:wait(0.1) end return {html=splash:html()} end 

Prior to Splash 2.3, you can use splash:evaljs('!document.querySelector(".my-element")') instead of not splash:select('.my-element') .

Save this script to a variable ( lua_script = """... """ ). Then you can send a request like this:

 yield SplashRequest( url, self.parse, endpoint='execute', args={ 'lua_source': lua_script, 'ua': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36" } } 

See the Scripting Tutorial and Reference for more information on how to write Splash Lua scripts.

+9
source

I have a similar requirement with timeouts. My solution is a small modification above:

 function wait_css(splash, css, maxwait) if maxwait == nil then maxwait = 10 --default maxwait if not given end local i=0 while not splash:select(css) do if i==maxwait then break --times out at maxwait secs end i=i+1 splash:wait(1) --each loop has duration 1sec end end 
+2
source

You can use lua script with javascript and splash:wait_for_resume ( documentation ).

 function main(splash, args) splash.resource_timeout = 60 assert(splash:go(splash.args.url)) assert(splash:wait(1)) splash.scroll_position = {y=500} result, error = splash:wait_for_resume([[ function main(splash) { var checkExist = setInterval(function() { if (document.querySelector(".css-selector").innerText) { clearInterval(checkExist); splash.resume(); } }, 1000); } ]], 30) assert(splash:wait(0.5)) return splash:html() end 

If you use the plugin without scrapy-splash , the attention to splash.args.url in splash:go will be different.

0
source

Source: https://habr.com/ru/post/1261084/


All Articles