How to write code for web crawling and curettage in R

I am trying to write code that will go to each page and get information from there. Url <- http://www.wikiart.org/en/claude-monet/mode/all-paintings-by-alphabet

I have code to output all hrefs. But that does not work.

library(XML) library(RCurl) library(stringr) tagrecode <- readHTMLTable ("http://www.wikiart.org/en/claude-monet/mode/all- paintings-by-alphabet") tabla <- as.data.frame(tagrecode) str(tabla) names (tabla) <- c("name", "desc", "cat", "updated") str(tabla) res <- htmlParse ("http://www.wikiart.org/en/claude-monet/mode/all-paintings-by- alphabet") enlaces <- getNodeSet (res, "//p[@class='pb5']/a/@href") enlaces <- unlist(lapply(enlaces, as.character)) tabla$enlace <- paste("http://www.wikiart.org/en/claude-monet/mode/all-paintings-by- alphabet") str(tabla) lisurl <- tabla$enlace fu1 <- function(url){ print(url) pas1 <- htmlParse(url, useInternalNodes=T) pas2 <- xpathSApply(pas1, "//p[@class='pb5']/a/@href") } urldef <- lapply(lisurl,fu1) 

After I have a list of URLs of all the photos on this page, I want to go to the second-third -...- 23 pages to collect the URLs of all the images.

The next step is to cancel the information about each image. I have working code for one, and I need to build it in one common code.

 library(XML) url = "http://www.wikiart.org/en/claude-monet/camille-and-jean-monet-in-the-garden-at-argenteuil" doc = htmlTreeParse(url, useInternalNodes=T) pictureName <- xpathSApply(doc,"//h1[@itemprop='name']", xmlValue) date <- xpathSApply(doc, "//span[@itemprop='dateCreated']", xmlValue) author <- xpathSApply(doc, "//a[@itemprop='author']", xmlValue) style <- xpathSApply(doc, "//span[@itemprop='style']", xmlValue) genre <- xpathSApply(doc, "//span[@itemprop='genre']", xmlValue) pictureName date author style genre 

Every tip on how to do this will be appreciated!

+6
source share
2 answers

This seems to work.

 library(XML) library(httr) url <- "http://www.wikiart.org/en/claude-monet/mode/all-paintings-by-alphabet/" hrefs <- list() for (i in 1:23) { response <- GET(paste0(url,i)) doc <- content(response,type="text/html") hrefs <- c(hrefs,doc["//p[@class='pb5']/a/@href"]) } url <- "http://www.wikiart.org" xPath <- c(pictureName = "//h1[@itemprop='name']", date = "//span[@itemprop='dateCreated']", author = "//a[@itemprop='author']", style = "//span[@itemprop='style']", genre = "//span[@itemprop='genre']") get.picture <- function(href) { response <- GET(paste0(url,href)) doc <- content(response,type="text/html") info <- sapply(xPath,function(xp)ifelse(length(doc[xp])==0,NA,xmlValue(doc[xp][[1]]))) } pictures <- do.call(rbind,lapply(hrefs,get.picture)) head(pictures) # pictureName date author style genre # [1,] "A Corner of the Garden at Montgeron" "1877" "Claude Monet" "Impressionism" "landscape" # [2,] "A Corner of the Studio" "1861" "Claude Monet" "Realism" "self-portrait" # [3,] "A Farmyard in Normandy" "c.1863" "Claude Monet" "Realism" "landscape" # [4,] "A Windmill near Zaandam" NA "Claude Monet" "Impressionism" "landscape" # [5,] "A Woman Reading" "1872" "Claude Monet" "Impressionism" "genre painting" # [6,] "Adolphe Monet Reading in the Garden" "1866" "Claude Monet" "Impressionism" "genre painting" 

You were very close. Your xPath is fine; one of the problems is that not all images have all the information (for example, for some pages that the nodes you are trying to inherit are blank) - pay attention to the date "A Windnill nead Zaandam". Therefore, the code must deal with this feature.

So, in this example, the first loop captures the values ​​of the href attribute of the anchor tags for each page (1:23) and combines them into a vector of length ~ 1300.

To handle each of these 1300 pages, and since we have to deal with missing tags, it’s easier to create a vector containing xPath strings and apply these elements to each page. What the get.picture(...) function get.picture(...) . The last statement calls this function with each of the 1300 hrefs and binds the result together line by line using do.call(rbind,...) .

Note that this code uses a slightly more compact indexing function for objects of the HTMLInternalDocument class: doc[xpath] where xpath is the xPath string. This avoids the use of xpathSApply(...) , although the latter might work.

+9
source

You can try the Rcrawler package , it is a parallel web scraper, it can scan, store web pages and clear its contents using XPath.

If you need information on all photos, use

 datapattern<-c( "//h1/span[@itemprop='name']", "//a[@class='artist-name']", "//*[@id='headSection']/article/form/div[1]/div/div/div[2]/div[2]/span[2]", "//*[@id='headSection']/article/form/div[1]/div/div/div[2]/div[3]/a/span", "//*[@id='headSection']/article/form/div[1]/div/div/div[2]/div[4]/a/span" ) Rcrawler(Website = "https://www.wikiart.org/", no_cores = 4, no_conn = 4, ExtractPatterns =datapattern ) 

Filter only Cloud Monet's photo

 Rcrawler(Website = "https://www.wikiart.org/", no_cores = 4, no_conn = 4, urlregexfilter ="claude-monet/([^/])*", ExtractPatterns =datapattern ) 

The scanner will take several times to complete, as it crosses all links to sites. However, you can stop execution at any time. By default, scrapers are in the global VARATED name DATA, another INDEX variable contains all the bypass URLs.

If you need to learn how to create your crawler, see this article. seeker R

0
source

Source: https://habr.com/ru/post/971774/


All Articles