Scraper with Nokogiri and Ruby before and after JavaScript changes the value

I have a program that resets a value from https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj

My current code is:

doc = Nokogiri::HTML(open(source_url)) puts doc.css('span.indexDate').text date = doc.css('span.indexDate').text date = Date.parse(date) puts date values = doc.css('table#CdsIndexTable td.col2 span') puts values 

Resets the date and values ​​of the second column from the "CDS Indexes" table correctly, which is good. Now I want to clear the same values ​​from the "Bond Indexes" table, where I ran into a problem.

I see that the JavaScript function changes it without loading the page and without changing the URL of the page. The difference between the two tables is that their identifiers are different, and this is exactly as it should be. But unfortunately, when I try:

 values = doc.css('table#BondIndexTable') puts values 

I get nothing from the bond index table. But I get the values ​​from the CDS Indexes table if I use:

 values = doc.css('table#CdsIndexTable') puts values 

How can I get values ​​from both tables?

+4
source share
3 answers

If you do not want to use PhantomJS, you can also use the network sniffer in Firefox or Chrome development tools, and you will see that the HTML table data is returned using a POST request on the javascript server.

Then, instead of opening the original URL of the page with Nokogiri, you instead run this POST from your Ruby script and analyze and interpret this data. It looks like it's just JSON data with HTML embedded in it. You can extract the HTML code and submit it to Nokogiri.

This requires a little extra detective work, but I have repeatedly used this method with JavaScript web pages and scrapers. It works fine for most simple tasks, but it takes a little digging into the inner workings of the page and network traffic.

Here is an example of JSON data from a Javascript POST request:

Bonds:
https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ

CDS:
https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=cds&ClientCode=WSJ

Here's a quick and dirty solution for you to understand. This will allow you to grab the cookie from the start page and use it in the request to receive JSON data, then parse the JSON data and pass the extracted HTML to Nokogiri:

 require 'rubygems' require 'nokogiri' require 'open-uri' require 'json' # Open the initial page to grab the cookie from it p1 = open('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj') # Save the cookie cookie = p1.meta['set-cookie'].split('; ',2)[0] # Open the JSON data page using our cookie we just obtained p2 = open('https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ', 'Cookie' => cookie) # Get the raw JSON json = p2.read # Parse it data = JSON.parse(json) # Feed the html portion to Nokogiri doc = Nokogiri.parse(data['html']) # Extract the values values = doc.css('td.col2 span') puts values.map(&:text).inspect => ["0.02%", "0.02%", "na", "-0.03%", "0.02%", "0.04%", "0.01%", "0.02%", "0.08%", "-0.01%", "0.03%", "0.01%", "0.05%", "0.04%"] 
+3
source

You can use Capybara with the Poltergeist driver to execute Javascript and page formatting. Poltergeist is a wrapper for a browser without a PhantomJS browser. Here is an example of how you can do this:

 require 'rubygems' require 'capybara' require 'capybara/dsl' require 'capybara/poltergeist' Capybara.default_driver = :poltergeist Capybara.run_server = false module GetPrice class WebScraper include Capybara::DSL def get_page_data(url) visit(url) doc = Nokogiri::HTML(page.html) doc.css('td.col2 span') end end end scraper = GetPrice::WebScraper.new puts scraper.get_page_data('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj').map(&:text).inspect 

Visit here for a complete example using Amazon.com: https://github.com/wakproductions/amazon_get_price/blob/master/getprice.rb

+15
source

PhantomJS is a browser-free browser with a JavaScript API. Since you need to run scripts on the page you are cleaning, the browser will do it for you; and PhantomJS will allow you to manipulate and clear the page after running the script.

+2
source

Source: https://habr.com/ru/post/1445803/


All Articles