Web scrapes IIS-based website

Question

Web scrapes IIS-based website

I use R for webscrape tables from this site .

I am using the rvest library.

 #install.packages("rvest", dependencies = TRUE) library(rvest) OPMpage <- read_html("https://www.opm.gov/policy-data-oversight/data-analysis-documentation/federal-employment-reports/historical-tables/total-government-employment-since-1962/")

I get this error:

Error in open.connection (x, "rb"): HTTP error 403.

What am I doing wrong?

+5

r web-scraping

Feyzi bagirov Feb 29 '16 at 2:18

source share

2 answers

alistaire · Answer 1 · 2016-02-29T04:25:36+0000

prevents you from accessing the page because you have NULL in the user-agent string of your headers. (This is usually a line indicating which browser you are using, although some browsers allow users to spoof other browsers.) Using the httr package, you can set the user-agent :

 library(httr) library(rvest) url <- "https://www.opm.gov/policy-data-oversight/data-analysis-documentation/federal-employment-reports/historical-tables/total-government-employment-since-1962/" x <- GET(url, add_headers('user-agent' = 'Gov employment data scraper ([[your email]])'))

Concluding the GET request, add_headers allows you to set whatever parameters you like. You can also use the more specific user_agent function instead of add_headers , if that's all you want to set.

In this case, any user-agent will work, but it is polite (see the link at the end) to say who you are and what you want.

Now you can use rvest to parse the HTML code and pull out the table. You need to choose a suitable table; looking at HTML, I saw that it had class = "DataTable" , but you can also use SelectorGadget (see rvest vignettes) to find the correct CSS or XPath selector. In this way,

 x %>% read_html() %>% html_node('.DataTable') %>% html_table()

gives you a nice (if not completely clean) data.frame.

Note: Rather, respond legally. Given that OPM is a public source, it is in the public domain, but this is not the case with a lot of Internet. Always read any terms of service, plus this nice post on how to bounce responsibly.

Hack-r · Answer 2 · 2016-02-29T02:52:30+0000

Your format for read_html or html correct:

 library(rvest) lego_movie <- read_html("http://www.imdb.com/title/tt1490017/") lego_movie <- html("http://www.imdb.com/title/tt1490017/")

But you get 403, because either the page or the part of the page you are trying to clear does not allow scraping.

You might need to see vignette("selectorgadget") and use selectorgadget in combination with rvest:

http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/

But most likely, this is simply not a page that needs to be cleared. However, I believe that Barack Obama and the new chief scientist of the United States, DJ Patil, recently rolled out the central center for obtaining this type of US government data for easy import.

Web scrapes IIS-based website

More articles: