How to integrate scrapyjs into a Scrapy project

Question

How to integrate scrapyjs into a Scrapy project

I made a web scraper using the Scrapy Framework to get live ticket data from this website . I was able to successfully clear the data from the elements inside each ticket that is on the page, except for the price, which can be accessed by clicking the "tickets" button to go to the tickets page and scrape off the ticket price from the ticket on the page.

After extensive Googling, I found that Scrapy.js (which is based on Splash) can be used in Scrapy to interact with JavaScript on the page (for example, the button that you need to click). I saw some basic examples of how Splash is used to interact with JavaScript, but none of them included the integration of Splash with Scrapy (even in documents).

I track the format of using element loaders to store scraped elements in the analysis method and then makes a request that is supposed to go to another link and parse the html from this page by calling the second analysis method

(eg yield scrapy.Request(next_link, callback=self.parse_price)

but the code for this would change a bit now when I use Scrapy js. To enable Scrapyjs, I was thinking about using features like this:

 function main(splash) splash:go("http://example.com") splash:wait(0.5) local title = splash:evaljs("document.title") return {title=title}

from this site , but since javascript cannot be written directly inside the python program, how / where would I include such a function in the program to go to the next page by clicking a button and parsing html? I'm obviously very new to web scraping, so any help at all would be greatly appreciated. The code for the spider is below:

concert_ticket_spider.py

 from scrapy.contrib.spiders import CrawlSpider , Rule from scrapy.selector import HtmlXPathSelector from scrapy.selector import Selector from scrapy.contrib.loader import XPathItemLoader from scrapy.contrib.loader.processor import Join, MapCompose from concert_comparator.items import ComparatorItem bandname = raw_input("Enter a bandname \n") vs_url = "http://www.vividseats.com/concerts/" + bandname + "-tickets.html" class MySpider(CrawlSpider): handle_httpstatus_list = [416] name = 'comparator' allowed_domains = ["www.vividseats.com"] start_urls = [vs_url] #rules = (Rule(LinkExtractor(allow=('/' + bandname + '-.*', )), callback='parse_price')) # item = ComparatorItem() tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]' item_fields = { 'eventName' : './/*[@class="productionsEvent"]/text()', 'eventLocation' : './/*[@class = "productionsVenue"]/span[@itemprop = "name"]/text()', 'ticketsLink' : './/a/@href', 'eventDate' : './/*[@class = "productionsDate"]/text()', 'eventCity' : './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressLocality"]/text()', 'eventState' : './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressRegion"]/text()', 'eventTime' : './/*[@class = "productionsTime"]/text()' } item_fields2 = { 'ticketPrice' : '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price]', } def parse_price(self, response): l.add_xpath('ticketPrice','.//*[@class = "price"]/text()' ) yield l.load_item() def parse(self, response): """ """ selector = HtmlXPathSelector(response) # iterate over tickets for ticket in selector.select(self.tickets_list_xpath): loader = XPathItemLoader(ComparatorItem(), selector=ticket) # define loader loader.default_input_processor = MapCompose(unicode.strip) loader.default_output_processor = Join() # iterate over fields and add xpaths to the loader for field, xpath in self.item_fields.iteritems(): loader.add_xpath(field, xpath) yield Request(vs_url, self.parse_result, meta= { 'splash': { 'args':{ #set rendering arguments here 'html' :1 # 'url' is prefilled from request url }, #optional parameters function main(splash) splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js") splash:go(vs_url) splash:runjs("$('#some-button').click()") return splash:html() end } }) for field, xpath in self.item_fields2.iteritems(): loader.add_xpath(field, xpath) yield loader.load_item()

+6

javascript python python-2.7 web-scraping scrapy

loremIpsum1771 Jun 29 '15 at 23:44

source share

1 answer

alecxe · Answer 1 · 2015-06-30T04:20:01+0000

The key point here is that scrapyjs provides scrapyjs.SplashMiddleware middleware that needs to be configured . Then, each request that will have a splash meta key will be processed by middleware.

FYI, I personally have successfully used Scrapy with scrapyjs before.

How to integrate scrapyjs into a Scrapy project

concert_ticket_spider.py

More articles: