Scrapy + Splash + ScrapyJS

Question

Scrapy + Splash + ScrapyJS

I am using Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1 and they still fail to display javascript with one click. Here is an example url https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf

I am still getting a page without a displayed phone number:

 class OlxSpider(scrapy.Spider): name = "olx" rotate_user_agent = True allowed_domains = ["olx.pt"] start_urls = [ "https://olx.pt/imoveis/" ] def parse(self, response): script = """ function main(splash) splash:go(splash.args.url) splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();') splash:wait(0.5) return splash:html() end """ for href in response.css('.link.linkWithHash.detailsLink::attr(href)'): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_house_contents, meta={ 'splash': { 'args': {'lua_source': script}, 'endpoint': 'execute', } }) for next_page in response.css('.pager .br3.brc8::attr(href)'): url = response.urljoin(next_page.extract()) yield scrapy.Request(url, self.parse) def parse_house_contents(self, response): import ipdb;ipdb.set_trace()

how can i make this work?

+5

python scrapy screen-scraping scrapy-spider

psychok7 Mar 03 '16 at 18:57

source share

2 answers

Add

 splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")

in a Lua script and it will work.

 function main(splash) splash:go(splash.args.url) splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js") splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();') splash:wait(0.5) return splash:html() end

.click () - jQuery function https://api.jquery.com/click/

+3

marvin Mar 05 '16 at 16:07

source share

alecxe · Accepted Answer · 2016-03-03T19:34:05+0000

You can avoid the need to use Splash in the first place and make a corresponding GET request to get the phone number yourself. Work Spider:

 import json import re import scrapy class OlxSpider(scrapy.Spider): name = "olx" rotate_user_agent = True allowed_domains = ["olx.pt"] start_urls = [ "https://olx.pt/imoveis/" ] def parse(self, response): for href in response.css('.link.linkWithHash.detailsLink::attr(href)'): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_house_contents) for next_page in response.css('.pager .br3.brc8::attr(href)'): url = response.urljoin(next_page.extract()) yield scrapy.Request(url, self.parse) def parse_house_contents(self, response): property_id = re.search(r"ID(\w+)\.", response.url).group(1) phone_url = "https://olx.pt/ajax/misc/contact/phone/%s/" % property_id yield scrapy.Request(phone_url, callback=self.parse_phone) def parse_phone(self, response): phone_number = json.loads(response.body)["value"] print(phone_number)

If there is anything else on this “dynamic” website, see if Splash is really enough, and if not, check out browser automation and selenium .

Scrapy + Splash + ScrapyJS

More articles: