I make a lot of SQL queries into the industry strength database far, but it takes a long time to get the results. It was much faster when my R computer was almost next to the database, which led me to believe that the delay between my computer and the database was a bottleneck, and working with parallel queries could speed things up. We are located on different continents.
Here is a working version that is not parallel:
doQueries <- function(filenameX, inp1, inp2) { print(paste("Starting:", inp1, inp2, ",saving to", filenameX, sep=" ")) # Here should the query be (using RODBC) # save(queryresults, file="filenameX") } input.rows <- cbind(c("file1.rda","file2.rda","file3.rda"),c("A","B","C"),c(12,13,14)) for (i in 1:nrow(input.rows)) { doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3]) }
I tried with the following code, but the foreach library does not seem to be accessible, and, as I understand it, from reading on CRAN, parallel replaces earlier packages for parallelization (the "foreach" package is not available (for version R 2.15.0) ").
library(parallel) library(foreach) foreach (i=1:nrow(input.rows)) %dopar% { doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3]) }
How can I do it?
Thanks to all the members of Stackoverflow!
/ Chris
Update: thanks to the senates, I was able to load the libraries. The following code seems to work:
library(RODBC) library(doParallel) library(foreach) # odbcCloseAll() # my_conn <- odbcConnect("database", uid="xx", pwd="yy", case="nochange") doQueries <- function(filenameX, inp1, inp2) { print(paste("Starting:", inp1, inp2, ",saving to", filenameX, sep=" ")) # sql.test <- RODBC::sqlQuery(my_conn, "SELECT * FROM zzz LIMIT 100", rows_at_time=1024) # save(sql.test, file="filenameX") } input.rows <- cbind(c("file1.rda","file2.rda","file3.rda"),c("A","B","C"),c(12,13,14)) cl <- makeCluster(3) registerDoParallel(cl) foreach (i=1:nrow(input.rows)) %dopar% { doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3]) } stopCluster(cl)
But when I turn on the actual SQL query, this error message appears: Error in {: task 1 failed - "the first argument is not an open RODBC channel"
Could it be that this conceptually does not work? What RODBC cannot handle multiple requests at a time?
I really appreciate all the support.
/ Chris
Update 2: Thank you very much for the very nice and impressive answers. It's hard to judge if data is being transferred faster (I think that overall throughput is 20% faster), but I found that since requests (about 100) have different response times and require post-processing (which I include in the function before saving), I get the best use of communication and local processor together. That is, with one request at a time, during data transfer, the CPUs are almost not used, and then the connection will be quiet while the CPUs are working. With parallel requests, I see that the data is coming in and the processors are working at the same time. As a result, it became much faster. Many thanks!
/ Chris