How to optimize a scraper with getURL () in R

Question

How to optimize a scraper with getURL () in R

I am trying to clear all accounts from two pages on the website of the French lower house of parliament. Pages cover 2002-2012. And make up less than 1000 accounts each.

To do this, I clear getURL through this loop:

 b <- "http://www.assemblee-nationale.fr" # base l <- c("12","13") # legislature id lapply(l, FUN = function(x) { print(data <- paste(b, x, "documents/index-dossier.asp", sep = "/")) # scrape data <- getURL(data); data <- readLines(tc <- textConnection(data)); close(tc) data <- unlist(str_extract_all(data, "dossiers/[[:alnum:]_-]+.asp")) data <- paste(b, x, data, sep = "/") data <- getURL(data) write.table(data,file=n <- paste("raw_an",x,".txt",sep="")); str(n) })

Is there a way to optimize the getURL() function here? It seems I cannot use simultaneous loading by passing async=TRUE parameter which gives me the same error every time:

 Error in function (type, msg, asError = TRUE) : Failed to connect to 0.0.0.12: No route to host

Any ideas? Thanks!

+6

r curl web-scraping

Fr Apr 9 '12 at 2:41

source share

2 answers

Why use R? For large cleaning jobs, you'd better use what is already designed for this task. I had good results with Down Them All, a browser was added. Just tell me where to start, how deep to go, which templates to follow, and where to dump HTML.

Then use R to read data from HTML files.

The benefits are huge - these add-ons are designed specifically for the task, so they will perform several downloads (controlled by you), they will send the correct headers, so your next question will not be “how to set the user agent string with RCurl?”, And they can handle the retry when some of the downloads fail, which they inevitably do.

Of course, the disadvantage is that you cannot easily start this process automatically, in which case you might be better off with a “curl” on the command line or another mirroring utility on the command line.

Honestly, you have better things to do with your time than writing site code in R ...

-4

Spacedman Apr 9 '12 at 8:03

source share

rsoren · Accepted Answer · 2014-03-18T23:13:50+0000

Try mclapply {multicore} instead of lapply.

"mclapply is a parallel version of lapply, it returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X." ( http://www.rforge.net/doc/packages/multicore/mclapply.html )

If this does not work, you can get better performance using XML . Functions like xmlTreeParse use an asynchronous call.

"Note that xmlTreeParse allows a hybrid processing style that allows handlers to be applied to nodes in the tree because they are converted to R. This is an event-driven style or asynchronous call." ( http://www.inside-r.org/packages/cran/XML/docs/xmlEventParse )

How to optimize a scraper with getURL () in R

More articles: