Scraper .asp site with R

I clear http://www.progarchives.com/album.asp?id= and get a warning message:

A warning:
The XML content does not seem to be XML:
http://www.progarchives.com/album.asp?id=2
http://www.progarchives.com/album.asp?id=3 http://www.progarchives.com/album.asp?id=4
http://www.progarchives.com/album.asp?id=5

The scraper works for each page separately, but not for the URL b1=2:b2=1000 .

  library(RCurl) library(XML) getUrls <- function(b1,b2){ root="http://www.progarchives.com/album.asp?id=" urls <- NULL for (bandid in b1:b2){ urls <- c(urls,(paste(root,bandid,sep=""))) } return(urls) } prog.arch.scraper <- function(url){ SOURCE <- getUrls(b1=2,b2=1000) PARSED <- htmlParse(SOURCE) album <- xpathSApply(PARSED,"//h1[1]",xmlValue) date <- xpathSApply(PARSED,"//strong[1]",xmlValue) band <- xpathSApply(PARSED,"//h2[1]",xmlValue) return(c(band,album,date)) } prog.arch.scraper(urls) 
+6
source share
2 answers

Here's an alternative approach with rvest and dplyr :

 library(rvest) library(dplyr) library(pbapply) base_url <- "http://www.progarchives.com/album.asp?id=%s" get_album_info <- function(id) { pg <- html(sprintf(base_url, id)) data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(), date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(), band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(), stringsAsFactors=FALSE) } albums <- bind_rows(pblapply(2:10, get_album_info)) head(albums) ## Source: local data frame [6 x 3] ## ## album date band ## 1 FOXTROT Studio Album, released in 1972 Genesis ## 2 NURSERY CRYME Studio Album, released in 1971 Genesis ## 3 GENESIS LIVE Live, released in 1973 Genesis ## 4 A TRICK OF THE TAIL Studio Album, released in 1976 Genesis ## 5 FROM GENESIS TO REVELATION Studio Album, released in 1969 Genesis ## 6 GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz 

I didn’t want a site with tons of reqs, thus increasing the sequence for your use. pblapply gives you a free progress bar.

To be nice to the site (esp, since it explicitly forbids clearing), you might want to reset Sys.sleep(10) at the end of the get_album_info function.

UPDATE

To handle server errors (in this case 500 , but it will work for others too), you can use try :

 library(rvest) library(dplyr) library(pbapply) library(data.table) base_url <- "http://www.progarchives.com/album.asp?id=%s" get_album_info <- function(id) { pg <- try(html(sprintf(base_url, id)), silent=TRUE) if (inherits(pg, "try-error")) { data.frame(album=character(0), date=character(0), band=character(0)) } else { data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(), date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(), band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(), stringsAsFactors=FALSE) } } albums <- rbindlist(pblapply(c(9:10, 23, 28, 29, 30), get_album_info)) ## album date band ## 1: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz ## 2: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz ## 3: AD INFINITUM Studio Album, released in 1998 Ad Infinitum 

You will not receive any entries for the error pages (in this case, it simply returns identifiers 9, 10 and 30).

+6
source

Instead of xpathApply() you can multiply the first node in the node sets of each path and call xmlValue() on it. Here is what I came up with

 library(XML) library(RCurl) ## define the urls and xpath queries urls <- sprintf("http://www.progarchives.com/album.asp?id=%s", 2:10) path <- c(album = "//h1", date = "//strong", band = "//h2") ## define a re-usable curl handle for the c-level nodes curl <- getCurlHandle() ## allocate the result list out <- vector("list", length(urls)) ## do the work for(u in urls) { content <- getURL(u, curl = curl) doc <- htmlParse(content, useInternalNodes = TRUE) out[[u]] <- lapply(path, function(x) xmlValue(doc[x][[1]])) free(doc) } ## structure the result data.table::rbindlist(out) # album date band # 1: FOXTROT Studio Album, released in 1972 Genesis # 2: NURSERY CRYME Studio Album, released in 1971 Genesis # 3: GENESIS LIVE Live, released in 1973 Genesis # 4: A TRICK OF THE TAIL Studio Album, released in 1976 Genesis # 5: FROM GENESIS TO REVELATION Studio Album, released in 1969 Genesis # 6: GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz # 7: GULLIBLES TRAVELS Studio Album, released in 1985 Abel Ganz # 8: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz # 9: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz 

Update: For processing requests id does not exist, we can write a condition with RCurl::url.exists() , which handles bad ones. Thus, the following getAlbums() function returns the character character of either the received xml or NA values, depending on the state of the URL. You can change this if you want, of course. It was just a method that came to mind in short hours.

 getAlbums <- function(url, id = numeric(), xPath = list()) { urls <- sprintf("%s?id=%d", url, id) curl <- getCurlHandle() out <- vector("list", length(urls)) for(u in urls) { out[[u]] <- if(url.exists(u)) { content <- getURL(u, curl = curl) doc <- htmlParse(content, useInternalNodes = TRUE) lapply(path, function(x) xmlValue(doc[x][[1]])) } else { warning(sprintf("returning 'NA' for urls[%d] ", id[urls == u])) structure(as.list(path[NA]), names = names(path)) } if(exists("doc")) free(doc) } data.table::rbindlist(out) } url <- "http://www.progarchives.com/album.asp" id <- c(9:10, 23, 28, 29, 30) path <- c(album = "//h1", date = "//strong", band = "//h2") getAlbums(url, id, path) # album date band # 1: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz # 2: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz # 3: NA NA NA # 4: NA NA NA # 5: NA NA NA # 6: AD INFINITUM Studio Album, released in 1998 Ad Infinitum # # Warning messages: # 1: In albums(url, id, path) : returning 'NA' for urls[23] # 2: In albums(url, id, path) : returning 'NA' for urls[28] # 3: In albums(url, id, path) : returning 'NA' for urls[29] 
+4
source

Source: https://habr.com/ru/post/983519/


All Articles