Parallel Loop on Windows

I make a lot of SQL queries into the industry strength database far, but it takes a long time to get the results. It was much faster when my R computer was almost next to the database, which led me to believe that the delay between my computer and the database was a bottleneck, and working with parallel queries could speed things up. We are located on different continents.

Here is a working version that is not parallel:

doQueries <- function(filenameX, inp1, inp2) { print(paste("Starting:", inp1, inp2, ",saving to", filenameX, sep=" ")) # Here should the query be (using RODBC) # save(queryresults, file="filenameX") } input.rows <- cbind(c("file1.rda","file2.rda","file3.rda"),c("A","B","C"),c(12,13,14)) for (i in 1:nrow(input.rows)) { doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3]) } 

I tried with the following code, but the foreach library does not seem to be accessible, and, as I understand it, from reading on CRAN, parallel replaces earlier packages for parallelization (the "foreach" package is not available (for version R 2.15.0) ").

 library(parallel) library(foreach) foreach (i=1:nrow(input.rows)) %dopar% { doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3]) } 

How can I do it?

Thanks to all the members of Stackoverflow!

/ Chris

Update: thanks to the senates, I was able to load the libraries. The following code seems to work:

 library(RODBC) library(doParallel) library(foreach) # odbcCloseAll() # my_conn <- odbcConnect("database", uid="xx", pwd="yy", case="nochange") doQueries <- function(filenameX, inp1, inp2) { print(paste("Starting:", inp1, inp2, ",saving to", filenameX, sep=" ")) # sql.test <- RODBC::sqlQuery(my_conn, "SELECT * FROM zzz LIMIT 100", rows_at_time=1024) # save(sql.test, file="filenameX") } input.rows <- cbind(c("file1.rda","file2.rda","file3.rda"),c("A","B","C"),c(12,13,14)) cl <- makeCluster(3) registerDoParallel(cl) foreach (i=1:nrow(input.rows)) %dopar% { doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3]) } stopCluster(cl) 

But when I turn on the actual SQL query, this error message appears: Error in {: task 1 failed - "the first argument is not an open RODBC channel"

Could it be that this conceptually does not work? What RODBC cannot handle multiple requests at a time?

I really appreciate all the support.

/ Chris

Update 2: Thank you very much for the very nice and impressive answers. It's hard to judge if data is being transferred faster (I think that overall throughput is 20% faster), but I found that since requests (about 100) have different response times and require post-processing (which I include in the function before saving), I get the best use of communication and local processor together. That is, with one request at a time, during data transfer, the CPUs are almost not used, and then the connection will be quiet while the CPUs are working. With parallel requests, I see that the data is coming in and the processors are working at the same time. As a result, it became much faster. Many thanks!

/ Chris

+6
source share
1 answer

As I mentioned in my comment, this method will probably not be faster. To answer your question, the foreach package is available for your version of R. Perhaps your selected repository has not yet been updated. Try the following:

 install.packages('foreach', repos='http://cran.us.r-project.org') 

which should install the package. If this does not work, take the binary code for your OS here and just install it via the menu.

If the bottleneck is the network connection, then you can only speed up the process by reducing the number of things you put on the network. One idea would be to remotely connect to a database server, dump a request for a file (on the server), compress it, and then download it to a computer, then R unpack and download it. It sounds a lot, but you can probably do the whole process within R.

Following your update, it seems like you haven't included the .packages argument in your foreach . That's why you needed a sqlQuery function prefix using RODBC:: . It is necessary to indicate which packets are required for the loop, because I think that it essentially starts a new R session for each node, and each session should be initialized with packets. Similarly, you cannot access my_conn because it was outside the loop, you need to create inside the loop so that each node has its own copy.

 library(RODBC) library(foreach) library(doParallel) setwd('C:/Users/x/Desktop') doQueries <- function(filenameX) { sql.text<-sqlQuery(my_conn, 'SELECT * FROM table;') save(sql.text, file=filenameX) } cl <- makeCluster(2) registerDoParallel(cl) foreach (i=1:2, .packages='RODBC') %dopar% { my_conn <- odbcConnect("db", uid="user", pwd="pass") doQueries(filenameX=paste('file_',i,sep='')) } 

But, as I mentioned, it probably won't be faster.

+7
source

Source: https://habr.com/ru/post/915818/


All Articles