This seems to work.
library(XML) library(httr) url <- "http://www.wikiart.org/en/claude-monet/mode/all-paintings-by-alphabet/" hrefs <- list() for (i in 1:23) { response <- GET(paste0(url,i)) doc <- content(response,type="text/html") hrefs <- c(hrefs,doc["//p[@class='pb5']/a/@href"]) } url <- "http://www.wikiart.org" xPath <- c(pictureName = "//h1[@itemprop='name']", date = "//span[@itemprop='dateCreated']", author = "//a[@itemprop='author']", style = "//span[@itemprop='style']", genre = "//span[@itemprop='genre']") get.picture <- function(href) { response <- GET(paste0(url,href)) doc <- content(response,type="text/html") info <- sapply(xPath,function(xp)ifelse(length(doc[xp])==0,NA,xmlValue(doc[xp][[1]]))) } pictures <- do.call(rbind,lapply(hrefs,get.picture)) head(pictures) # pictureName date author style genre # [1,] "A Corner of the Garden at Montgeron" "1877" "Claude Monet" "Impressionism" "landscape" # [2,] "A Corner of the Studio" "1861" "Claude Monet" "Realism" "self-portrait" # [3,] "A Farmyard in Normandy" "c.1863" "Claude Monet" "Realism" "landscape" # [4,] "A Windmill near Zaandam" NA "Claude Monet" "Impressionism" "landscape" # [5,] "A Woman Reading" "1872" "Claude Monet" "Impressionism" "genre painting" # [6,] "Adolphe Monet Reading in the Garden" "1866" "Claude Monet" "Impressionism" "genre painting"
You were very close. Your xPath is fine; one of the problems is that not all images have all the information (for example, for some pages that the nodes you are trying to inherit are blank) - pay attention to the date "A Windnill nead Zaandam". Therefore, the code must deal with this feature.
So, in this example, the first loop captures the values ββof the href attribute of the anchor tags for each page (1:23) and combines them into a vector of length ~ 1300.
To handle each of these 1300 pages, and since we have to deal with missing tags, itβs easier to create a vector containing xPath strings and apply these elements to each page. What the get.picture(...) function get.picture(...) . The last statement calls this function with each of the 1300 hrefs and binds the result together line by line using do.call(rbind,...) .
Note that this code uses a slightly more compact indexing function for objects of the HTMLInternalDocument class: doc[xpath] where xpath is the xPath string. This avoids the use of xpathSApply(...) , although the latter might work.