prevents you from accessing the page because you have NULL
in the user-agent
string of your headers. (This is usually a line indicating which browser you are using, although some browsers allow users to spoof other browsers.) Using the httr
package, you can set the user-agent
:
library(httr) library(rvest) url <- "https://www.opm.gov/policy-data-oversight/data-analysis-documentation/federal-employment-reports/historical-tables/total-government-employment-since-1962/" x <- GET(url, add_headers('user-agent' = 'Gov employment data scraper ([[your email]])'))
Concluding the GET
request, add_headers
allows you to set whatever parameters you like. You can also use the more specific user_agent
function instead of add_headers
, if that's all you want to set.
In this case, any user-agent
will work, but it is polite (see the link at the end) to say who you are and what you want.
Now you can use rvest
to parse the HTML code and pull out the table. You need to choose a suitable table; looking at HTML, I saw that it had class = "DataTable"
, but you can also use SelectorGadget (see rvest
vignettes) to find the correct CSS or XPath selector. In this way,
x %>% read_html() %>% html_node('.DataTable') %>% html_table()
gives you a nice (if not completely clean) data.frame.
Note: Rather, respond legally. Given that OPM is a public source, it is in the public domain, but this is not the case with a lot of Internet. Always read any terms of service, plus this nice post on how to bounce responsibly.
source share