I wrote a function to use RCURL to get an effective URL for a list of shortened redirected URLs (bit.ly, t.co, etc.) and handle errors when an effective URL finds a document (PDF files are usually , Error in curlPerform ... embedded nul in string. ")
I would like to make this function more efficient if possible (storing it in R). As written, the runtime is excessively long in order to reduce several thousand URLs.
?getURItells us that by default, getURI / getURL runs asynchronously when the length of the URL vector is> 1. But my performance seems completely linear, apparently because it sapplyturns the thing into one big loop and concurrency is lost.
Is there a way to speed up these queries? Extra credit to fix the "built-in zero" issue.
require(RCurl)
options(RCurlOptions = list(verbose = F, followlocation = T,
timeout = 500, autoreferer = T, nosignal = T,
useragent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)"))
getEffectiveUrl <- function(url){
c = getCurlHandle()
h = basicHeaderGatherer()
curlSetOpt( .opts = list(header=T, verbose=F), curl= c, .encoding = "CE_LATIN1")
possibleError <- tryCatch(getURI( url, curl=c, followlocation=T,
headerfunction = h$update, async=T),
error=function(e) e)
if(inherits(possibleError, "error")){
effectiveUrl <- "ERROR_IN_PAGE"
} else {
headers <- h$value()
names(headers) <- tolower(names(headers))
statusPrefix <- substr(headers[["status"]],1,1)
if(statusPrefix=="2"){
effectiveUrl <- getCurlInfo(c)[["effective.url"]]
} else{ effectiveUrl <- paste(headers[["status"]] ,headers[["statusmessage"]]) }
}
effectiveUrl
}
testUrls <- c("http://t.co/eivRJJaV4j","http://t.co/eFfVESXE2j","http://t.co/dLI6Q0EMb0",
"http://www.google.com","http://1.uni.vi/01mvL","http://t.co/05Mz00DHLD",
"http://t.co/30aM6L4FhH","http://www.amazon.com","http://bit.ly/1fwWZLK",
"http://t.co/cHglxQkz6Z")
system.time(
effectiveUrls <- sapply(X= testUrls, FUN=getEffectiveUrl, USE.NAMES=F)
) # takes 7-10 secs on my laptop
# does Vectorize help?
vGetEffectiveUrl <- Vectorize(getEffectiveUrl, vectorize.args = "url")
system.time(
effectiveUrls2 <- vGetEffectiveUrl(testUrls)
) # nope, makes it worse