The following is a script to reproduce the problems that I encounter when creating a crawler with RCurl that performs parallel queries. The goal is to download the contents of several thousand websites for statistical analysis. Therefore, the solution must scale.
library(RCurl) library(httr) uris = c("inforapido.com.ar", "lm.facebook.com", "promoswap.enterfactory.com", "p.brilig.com", "wap.renxo.com", "alamaula.com", "syndication.exoclick.com", "mcp-latam.zed.com", "startappexchange.com", "fonts.googleapis.com", "xnxx.com", "wv.inner-active.mobi", "canchallena.lanacion.com.ar", "android.ole.com.ar", "livefyre.com", "fbapp://256002347743983/thread") Split uris in 3 uris_ls = split(uris, 1:3) Here are examples of url that aren't working url_not_working = result[result$number_char == 0, 1] # url_not_working # [1] "inforapido.com.ar" "canchallena.lanacion.com.ar" "fbapp://256002347743983/thread" # [4] "xnxx.com" "startappexchange.com" "wv.inner-active.mobi" # [7] "livefyre.com" ### Using httr GET it works fine get_httr = GET(url_not_working[2]) content(g, 'text') # The result is the same when using a single call get_rcurl = getURL(url_not_working[2], encoding='UTF-8', timeout = 2, maxredirs = 3, verbose = TRUE, followLocation = TRUE) get_rcurl
Question:
Given the number of web pages I need to crawl, I would prefer to use RCURL for this, since it supports concurrent requests. I wonder if it is possible to improve the call to getURLs () to make it work as the GET () version in cases where the getURL / getURLs version fails.
UPDATE:
I added data containing more data (990 draws) to better reproduce the problem.
uris_ls <- dput() # dput() output found here: https:
After launch:
uris_content <- list() for(i in seq_along(uris_ls)){ uris_content[[i]] <- getURIs(uris_ls[[i]]) }
I get the following error:
Error in curlMultiPerform(obj) : embedded nul in string: 'GIF89a\001' In addition: Warning message: In strsplit(str, "\\\r\\\n") : input string 1 is invalid in this locale
Using getURIAsynchronous:
uris_content <- list() for(i in seq_along(uris_ls)){ uris_content[[i]] <- getURIAsynchronous(uris_ls[[i]], .opts=list(timeout = 2, maxredirs = 3, verbose = TRUE, followLocation = TRUE)) }
I get a similar error: Error in nchar (str): invalid multibyte string 1
UPDATE 2
library(RCurl) uris_ls <- dput() # dput() output found here: https:
Try the following:
Sys.setlocale(locale="C") uris_content <- list() for(i in seq_along(uris_ls)){ uris_content[[i]] <- getURIAsynchronous(uris_ls[[i]], .opts=list(timeout = 2, maxredirs = 3, verbose = TRUE, followLocation = TRUE)) }
The result is that it works well for the first 225 URLs and then just returns cero content from the website. Is this an error?