Make concurrent RCURL GET requests for a set of URLs

I wrote a function to use RCURL to get an effective URL for a list of shortened redirected URLs (bit.ly, t.co, etc.) and handle errors when an effective URL finds a document (PDF files are usually , Error in curlPerform ... embedded nul in string. ")

I would like to make this function more efficient if possible (storing it in R). As written, the runtime is excessively long in order to reduce several thousand URLs.

?getURItells us that by default, getURI / getURL runs asynchronously when the length of the URL vector is> 1. But my performance seems completely linear, apparently because it sapplyturns the thing into one big loop and concurrency is lost.

Is there a way to speed up these queries? Extra credit to fix the "built-in zero" issue.

require(RCurl)

options(RCurlOptions = list(verbose = F, followlocation = T,
                        timeout = 500, autoreferer = T, nosignal = T,
                        useragent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)"))

# find successful location (or error msg) after any redirects
getEffectiveUrl <- function(url){ 
  c = getCurlHandle()
  h = basicHeaderGatherer()
  curlSetOpt( .opts = list(header=T, verbose=F), curl= c, .encoding = "CE_LATIN1")
  possibleError <- tryCatch(getURI( url, curl=c, followlocation=T, 
                                    headerfunction = h$update, async=T),
                            error=function(e) e)  
  if(inherits(possibleError, "error")){
    effectiveUrl <- "ERROR_IN_PAGE" # fails on linked documents (PDFs etc.)
  } else { 
    headers <- h$value()
    names(headers) <- tolower(names(headers)) #sometimes cases change on header names?
    statusPrefix <- substr(headers[["status"]],1,1) #1st digit of http status
    if(statusPrefix=="2"){ # status = success
      effectiveUrl <- getCurlInfo(c)[["effective.url"]]
    } else{ effectiveUrl <- paste(headers[["status"]] ,headers[["statusmessage"]]) } 
  }
  effectiveUrl
}

testUrls <- c("http://t.co/eivRJJaV4j","http://t.co/eFfVESXE2j","http://t.co/dLI6Q0EMb0",
              "http://www.google.com","http://1.uni.vi/01mvL","http://t.co/05Mz00DHLD",
              "http://t.co/30aM6L4FhH","http://www.amazon.com","http://bit.ly/1fwWZLK",
              "http://t.co/cHglxQkz6Z") # 10th URL redirects to content w/ embedded nul
system.time(
  effectiveUrls <- sapply(X= testUrls, FUN=getEffectiveUrl, USE.NAMES=F)
) # takes 7-10 secs on my laptop

# does Vectorize help? 
vGetEffectiveUrl <- Vectorize(getEffectiveUrl, vectorize.args = "url")
system.time(
  effectiveUrls2 <- vGetEffectiveUrl(testUrls)
) # nope, makes it worse
+4
source share
1 answer

I had bad experience with querying RCurl and Async. R will freeze completely (although error messages, CPU and RAM did not burst) with only 20 simultaneous requests.

I recommend switching to CURL and using the curl_fetch_multi () function. In my case, it could easily handle a 50,000 JSON request in one pool (with some division by sub-items under the hood). https://cran.r-project.org/web/packages/curl/vignettes/intro.html#async_requests

+2
source

Source: https://habr.com/ru/post/1531851/


All Articles