R Inconsistency between browser and GET / getURL

Question

R Inconsistency between browser and GET / getURL

I am trying to download content from a page, and I found that the response data is either distorted or incomplete, as if GET or getURL were pulled before this data was loaded.

library(httr) library(RCurl) url <- "https://www.vanguardcanada.ca/individual/etfs/etfs.htm" d1 <- GET(url) # This shows a lot of {{ moustache style }} code that not filled d2 <- getURL(url) # This shows "" as if it didn't get anything

I am not sure how to proceed. My goal is to get the numbers associated with the links that are displayed in the browser:

 https://www.vanguardcanada.ca/individual/etfs/etfs-detail-overview.htm?portId=9548

So, in this case, I want to load and clear "9548".

Not sure why getURL and GET seem to get completely different results than what is presented in the browser. It seems that the data is loading slowly and almost as if GET and getURL were pulled to a full load.

For example, see:

 x <- "https://www.vanguardcanada.ca/individual/etfs/etfs-detail-prices.htm?portId=9548" readHTMLTable(htmlParse(GET(x)))

+4

r curl rcurl httr

Brandon bertelsen Aug 7 '14 at 2:21

source share

1 answer

Mrflick · Accepted Answer · 2014-08-07T03:49:20+0000

It is important to understand that when you clean a web page, you get the HTML source code for that page; this is not necessarily what you will interact with in a web browser. When you call GET(url) , you get the actual html/text that is the source of this page. This is what is sent directly from the server. Currently, most web pages also assume that the browser not only displays HMTL, but also executes javascript code on this page. This is especially true when most of the content on the page is generated later using javascript. This is exactly what is happening on this page. The "content" on the page was not found in the html source of this page; later it loads via javascript.

Neither httr nor RCurl will execute the javascript needed to "populate" the page with the table that you are actually viewing. There is a package called RSelenium that is able to interact with the browser to run javascript, but in this case we can really get around this.

First, an additional note on why getURL did not work. This web server seems to be listening on the user agent sent by the requesting program to send other content back. Whichever user agent is used by default RCurl, it is not considered "good" to receive html from the server. You can get around this by specifying a different user agent. For instance,

 d2 <- getURL(url, .opts=list(useragent="Mozila 5.0"))

seems to work.

But back to the main problem. When dealing with such issues, I highly recommend that you use the Chrome Developer Tools (or any other equivalent in your favorite browser). In the Chrome developer tools, in particular on the Network tab, you can see all the requests made by Chrome to get data

If you click on the first ("etfs.html"), you will see the headers and the response to this request. On the answer tab, you should see exactly the same content that is found in GET or getURL . Then we load a bunch of CSS and javascript files. The file that looked most interesting was "GetETFJson.js". This seems to be actually most of the data in an almost JSON-like format. In fact, it has real javascript in front of the JSON block that is interfering. But we can download this file using

 d3 <- GET("https://www.vanguardcanada.ca/individual/mvc/GetETFJson.js")

and extract the contents as text with

 p3 <- content(d3, as="text")

and then turn it into an object R using

 library(jsonlite) r3 <- fromJSON(substr(p3,13,nchar(p3)))

again, we use substr above to remove non-JSON stuff to facilitate analysis.

Now you can examine the returned object. But it looks like the required data is stored in the following vectors

 cbind(r3$fundData$Fund$profile$portId, r3$fundData$Fund$profile$benchMark) [,1] [,2] [1,] "9548" "FTSE All World ex Canada Index in CAD" [2,] "9561" "FTSE Canada All Cap Index in CAD" [3,] "9554" "Spliced Canada Index" [4,] "9559" "FTSE Canada All Cap Real Estate Capped 25% Index" [5,] "9560" "FTSE Canada High Dividend Yield Index" [6,] "9550" "FTSE Developed Asia Pacific Index in CAD" [7,] "9549" "FTSE Developed Europe Index in CAD" [8,] "9558" "FTSE Developed ex North America Index in CAD" [9,] "9555" "Spliced FTSE Developed ex North America Index Hedged in CAD" [10,] "9556" "Spliced Emerging Markets Index in CAD" [11,] "9563" "S&P 500 Index in CAD" [12,] "9562" "S&P 500 Index in CAD Hedged" [13,] "9566" "NASDAQ US Dividend Achievers Select Index in CAD" [14,] "9564" "NASDAQ US Dividend Achievers Select Index Hedged in CAD" [15,] "9557" "CRSP US Total Market Index in CAD" [16,] "9551" "Spliced US Total Market Index Hedged in CAD" [17,] "9552" "Barclays Global Aggregate CAD Float Adjusted Index in CAD" [18,] "9553" "Barclays Global Aggregate CAD 1-5 Year Govt/Credit Float Adj Ix in CAD" [19,] "9565" "Barclays Global Aggregate Canadian 1-5 Year Credit Float Adjusted Index in CAD" [20,] "9568" "Barclays Global Aggregate ex-USD Float Adjusted RIC Capped Index Hedged in CAD" [21,] "9567" "Barclays US Aggregate Float Adjusted Index Hedged in CAD"

Hopefully this will be enough to extract the data needed to determine the path to the URL with additional data.

R Inconsistency between browser and GET / getURL

More articles: