Using rvest how to extract html content from object returned by submit_form ()

Question

Using rvest how to extract html content from object returned by submit_form ()

I am trying to download traffic data from pems.dot.ca.gov by following this section .

rm(list=ls()) library(rvest) library(xml2) library(httr) url <- "http://pems.dot.ca.gov/?report_form=1&dnode=tmgs&content=tmg_volumes&tab=tmg_vol_ts&export=&tmg_station_id=74250&s_time_id=1369094400&s_time_id_f=05%2F21%2F2013&e_time_id=1371772740&e_time_id_f=06%2F20%2F2013&tod=all&tod_from=0&tod_to=0&dow_5=on&dow_6=on&tmg_sub_id=all&q=obs_flow&gn=hour&html.x=34&html.y=8" pgsession <- html_session(url) pgform <-html_form(pgsession)[[1]] filled_form <- set_values(pgform, 'username' = 'omitted', 'password' = 'omitted') resp = submit_form(pgsession, filled_form) resp_2 = resp$response cont = resp_2$content

I checked the class() these elements and found that resp is a "session", resp_2 is a "response", and cont is "raw". My question is: how can I extract the html content correctly so that I can continue with XPath to select the actual data I want from this page? My intuition is that I have to parse resp_2, which is the answer, but I just can't get it to work. Your help is greatly appreciated!

+1

html r html-parsing web-scraping rvest

user3768495 Jul 31 '16 at 18:19

source share

2 answers

You need httr::content , which parses the response to the content, which in this case is HTML, which can be easily parsed using rvest :

 resp_2 %>% content() ## {xml_document} ## <html style="height: 100%"> ## [1] <head>\n <!-- public -->\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/ ## ... ## [2] <body class="yui-skin-sam public">\n <div id="maincontainer" style="height: 100%">\n\n \n\ ## ...

+2

alistaire Jul 31 '16 at 23:18

source share

hrbrmstr · Accepted Answer · 2016-07-31T23:57:17+0000

This should do it:

 pg <- content(resp$response) html_nodes(pg, "table.inlayTable") %>% html_table() -> tab head(tab[[1]]) ## X1 X2 X3 X4 ## 1 Data Quality Data Quality ## 2 Hour 8 Lanes % Observed % Estimated ## 3 05/24/2013 00:00 1,311 50 0 ## 4 05/24/2013 01:00 729 50 0 ## 5 05/24/2013 02:00 399 50 0 ## 6 05/24/2013 03:00 487 50 0

(you obviously need to change the column names)

Using rvest how to extract html content from object returned by submit_form ()

More articles: