Run function R with multiple parameters in parallel

I have a function

function1 <- function(df1, df2, int1, int2, char1) { ... return(newDataFrame) } 

which has 5 inputs: the first 2 are data frames, then I have two integers and a string. The function returns a new data frame.

So far, I have performed this function 8 times in a row:

 newDataFrame1 <- function1(df1, df2, 1, 1, "someString") newDataFrame2 <- function1(df1, df2, 2, 0, "someString") newDataFrame3 <- function1(df1, df2, 3, 0, "someString") newDataFrame4 <- function1(df1, df2, 4, 0, "someString") newDataFrame5 <- function1(df1, df2, 5, 0, "someString") newDataFrame6 <- function1(df1, df2, 6, 0, "someString") newDataFrame7 <- function1(df1, df2, 7, 0, "someString") newDataFrame8 <- function1(df1, df2, 8, 0, "someString") 

and in the end I combine the results using rbind ():

 newDataFrameTot <- rbind(newDataFrame1, newDataFrame2, newDataFrame3, newDataFrame4, newDataFrame5, newDataFrame6, newDataFrame7, newDataFrame8) 

I wanted to run this in parallel using the library (in parallel), but I cannot figure out how to do this. I'm trying to:

 cluster <- makeCluster(detectCores()) result <- clusterApply(cluster,1:8,function1) newDataFrameTot <- do.call(rbind,result) 

but this does not work if my function1 () function does not have only one parameter, which I execute from 1 to 8. But this is not my business, since I need to pass 5 inputs. How can I do this work in parallel?

+5
source share
3 answers

To clusterMap over more than one variable, clusterMap very useful. Since you only repeat int1 and int2 , you should use the "MoreArgs" option to specify variables that you do not iterate:

 cluster <- makeCluster(detectCores()) clusterEvalQ(cluster, library(xts)) result <- clusterMap(cluster, function1, int1=1:8, int2=c(1, rep(0, 7)), MoreArgs=list(df1=df1, df2=df2, char1="someString")) df <- do.call('rbind', result) 

In particular, if df1 and df2 are data frames, and they are specified as iteration variables instead of using "MoreArgs", clusterMap will clusterMap over the columns of these data frames, rather than passing the entire data frame to function1 , which you do not need.

Note that it is important to use named arguments so that the arguments are passed correctly.


Performance note

If df1 or df2 very large, you can get better performance by exporting them to workgroups. This avoids sending them in each task, but requires a wrapper function. It also means that you no longer need to use the MoreArgs option:

 clusterExport(cluster, c('df1', 'df2', 'function1')) wrapper <- function(int1, int2, char1) { function1(df1, df2, int1, int2, char1) } result <- clusterMap(cluster, wrapper, 1:8, c(1, rep(0, 7)), "someString") 

This allows you to reuse df1 and df2 if workers perform several tasks, but it makes no sense if the number of tasks is equal to the number of workers.

+5
source

To pass a single variable, you will have to use the parallel version of lapply or sapply as you tried. However, to pass many variables, you must use the parallel version of mapply or Map . This will be clusterMap , so try

 clusterMap(cluster, function1, df1, df2, 1:8, c(1, rep(0, 7)), "someString") 

Edit As indicated in the comments, this will cause an error. Usually arguments of length 1 (for example, "someString" in this example) should be processed along the length of others (for example, 1:8 in this example). The error thrown is due to the fact that data frames are not processed in the same way, but instead are treated as lists, so their columns are repeated, not the entire data frame. That's why you got the error $ operator is invalid for atomic vectors , because inside function1 she tried to use $ in the extracted column of the data frame, which was a vector and not the data frame itself. There are two remedies for this. The first is to pass additional arguments inside MoreArgs , as indicated in another answer. This requires your arguments to be named (which is good practice anyway). The second way to fix this is to wrap each data frame in a list:

 clusterMap(cluster, function1, list(df1), list(df2), 1:8, c(1, rep(0, 7)), "someString") 

This will work, because now all data frames df1 and df2 will be processed. The difference can be seen, for example. looking at the output of rep(df1, 2) vs rep(list(df1), 2) .

+5
source

Since I had the same problem recently in R, I am adding a link to a very useful website. This is a new multidplyr package that allows parallel processing in R. It definitely works on Windows 10. :)

http://www.business-science.io/code-tools/2016/12/18/multidplyr.html

To help you with your code, this will be the solution that I would suggest (not tested, but should work as I used it in another example)

 #Install the packages install.packages("devtools") devtools::install_github("hadley/multidplyr") require(multidplyr) library(parallel) cl <- detectCores() cluster <- create_cluster(cores = cl) cluster %>% # Assign libraries cluster_library("igraph") %>% cluster_library("tidyverse") %>% cluster_library("magrittr") %>% cluster_library("dplyr") %>% cluster_library("RColorBrewer") %>% # Assign values (use this to load functions or data to each core) cluster_assign_value("anyfunction", anyfunction) result <- clusterMap(cluster, function1, int1=1:8, int2=c(1, rep(0, 7)), MoreArgs=list(df1=df1, df2=df2, char1="someString")) 
0
source

Source: https://habr.com/ru/post/1206553/


All Articles