Here's an alternative approach with rvest and dplyr :
library(rvest) library(dplyr) library(pbapply) base_url <- "http://www.progarchives.com/album.asp?id=%s" get_album_info <- function(id) { pg <- html(sprintf(base_url, id)) data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(), date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(), band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(), stringsAsFactors=FALSE) } albums <- bind_rows(pblapply(2:10, get_album_info)) head(albums)
I didn’t want a site with tons of reqs, thus increasing the sequence for your use. pblapply gives you a free progress bar.
To be nice to the site (esp, since it explicitly forbids clearing), you might want to reset Sys.sleep(10) at the end of the get_album_info function.
UPDATE
To handle server errors (in this case 500 , but it will work for others too), you can use try :
library(rvest) library(dplyr) library(pbapply) library(data.table) base_url <- "http://www.progarchives.com/album.asp?id=%s" get_album_info <- function(id) { pg <- try(html(sprintf(base_url, id)), silent=TRUE) if (inherits(pg, "try-error")) { data.frame(album=character(0), date=character(0), band=character(0)) } else { data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(), date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(), band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(), stringsAsFactors=FALSE) } } albums <- rbindlist(pblapply(c(9:10, 23, 28, 29, 30), get_album_info))
You will not receive any entries for the error pages (in this case, it simply returns identifiers 9, 10 and 30).