I made a web scraper using the Scrapy Framework to get live ticket data from this website . I was able to successfully clear the data from the elements inside each ticket that is on the page, except for the price, which can be accessed by clicking the "tickets" button to go to the tickets page and scrape off the ticket price from the ticket on the page.
After extensive Googling, I found that Scrapy.js (which is based on Splash) can be used in Scrapy to interact with JavaScript on the page (for example, the button that you need to click). I saw some basic examples of how Splash is used to interact with JavaScript, but none of them included the integration of Splash with Scrapy (even in documents).
I track the format of using element loaders to store scraped elements in the analysis method and then makes a request that is supposed to go to another link and parse the html from this page by calling the second analysis method
(eg yield scrapy.Request(next_link, callback=self.parse_price)
but the code for this would change a bit now when I use Scrapy js. To enable Scrapyjs, I was thinking about using features like this:
function main(splash) splash:go("http://example.com") splash:wait(0.5) local title = splash:evaljs("document.title") return {title=title}
from this site , but since javascript cannot be written directly inside the python program, how / where would I include such a function in the program to go to the next page by clicking a button and parsing html? I'm obviously very new to web scraping, so any help at all would be greatly appreciated. The code for the spider is below:
concert_ticket_spider.py
from scrapy.contrib.spiders import CrawlSpider , Rule from scrapy.selector import HtmlXPathSelector from scrapy.selector import Selector from scrapy.contrib.loader import XPathItemLoader from scrapy.contrib.loader.processor import Join, MapCompose from concert_comparator.items import ComparatorItem bandname = raw_input("Enter a bandname \n") vs_url = "http://www.vividseats.com/concerts/" + bandname + "-tickets.html" class MySpider(CrawlSpider): handle_httpstatus_list = [416] name = 'comparator' allowed_domains = ["www.vividseats.com"] start_urls = [vs_url]
source share