R parallel: rbind parallel to separate data.frames

The code below gives different results on Windows and Ubuntu platforms. I understand that this is due to various parallel processing methods.

In summary:
I cannot insert / rbind Linux data in parallel ( mclapply , mcmapply ), but I can do it on Windows .

Thanks to @Hong Ooi for pointing out that mclapply does not work on Windows in parallel, but is still a question.

Of course, there are no several attachments to the same data.frame , each insert is performed in a separate data.frame file.

 library(R6) library(parallel) # storage objects generator cl <- R6Class( classname = "cl", public = list( data = data.frame(NULL), initialize = function() invisible(self), insert = function(x) self$data <- rbind(self$data, x) ) ) N <- 4L # number of entities i <- setNames(seq_len(N),paste0("n",seq_len(N))) # random data.frames set.seed(1) ldt <- lapply(i, function(i) data.frame(replicate(sample(3:10,1),sample(letters,1e5,rep=TRUE)))) # entity storage lcl1 <- lapply(i, function(i) cl$new()) lcl2 <- lapply(i, function(i) cl$new()) lcl3 <- lapply(i, function(i) cl$new()) # insert data invisible({ mclapply(names(i), FUN = function(n) lcl1[[n]]$insert(ldt[[n]])) mcmapply(FUN = function(dt, cl) cl$insert(dt), ldt, lcl2, SIMPLIFY=FALSE) lapply(names(i), FUN = function(n) lcl3[[n]]$insert(ldt[[n]])) }) ### Windows sapply(lcl1, function(cl) nrow(cl$data)) # mclapply # n1 n2 n3 n4 # 100000 100000 100000 100000 sapply(lcl2, function(cl) nrow(cl$data)) # mcmapply # n1 n2 n3 n4 # 100000 100000 100000 100000 sapply(lcl3, function(cl) nrow(cl$data)) # lapply # n1 n2 n3 n4 # 100000 100000 100000 100000 ### Unix sapply(lcl1, function(cl) nrow(cl$data)) # mclapply #n1 n2 n3 n4 # 0 0 0 0 sapply(lcl2, function(cl) nrow(cl$data)) # mcmapply #n1 n2 n3 n4 # 0 0 0 0 sapply(lcl3, function(cl) nrow(cl$data)) # lapply # n1 n2 n3 n4 # 100000 100000 100000 100000 

And the question is:

How can I achieve rbind in parallel in a separate data.frame on a Linux platform?

PS Out-of-memory storage, such as SQLite , cannot be considered as a solution in my case.

+6
source share
2 answers

The problem is that mclapply and mcmapply not intended to be used with functions that have side effects. Your function modifies the objects in the lists, but mclapply does not send the changed objects back to the main process: it only returns the values ​​explicitly returned by the function. This means that your results are lost when workers exit with mclapply .

Normally, I would change the code so as not to depend on side effects, and return objects that were changed. Here is one way to do this with clusterApply so that it also works in parallel on Windows:

 library(R6) library(parallel) cl <- R6Class( classname = "cl", public = list( data = data.frame(NULL), initialize = function() invisible(self), insert = function(x) self$data <- rbind(self$data, x))) N <- 4L # number of entities i <- setNames(seq_len(N),paste0("n",seq_len(N))) set.seed(1) ldt <- lapply(i, function(i) data.frame(replicate(sample(3:10,1),sample(letters,1e5,rep=TRUE)))) nw <- 3 # number of workers clust <- makePSOCKcluster(nw) idx <- splitIndices(length(i), nw) nameslist <- lapply(idx, function(iv) names(i)[iv]) lcl4 <- do.call('c', clusterApply(clust, nameslist, function(nms, cl, ldt) { library(R6) lcl4 <- lapply(nms, function(n) cl$new()) names(lcl4) <- nms lapply(nms, FUN = function(n) lcl4[[n]]$insert(ldt[[n]])) lcl4 }, cl, ldt)) 

This method does not work if you want to create a list of objects once, and then change objects several times in parallel. This can also be done, but you must have regular employees. In this case, you receive the changed objects from the workers after completing all tasks. Unfortunately, mclapply does not use permanent workers, so in this case you should use cluster functions such as clusterApply . Here is one way to do this:

 # Initialize the cluster workers clusterEvalQ(clust, library(R6)) clusterExport(clust, c('cl', 'ldt')) clusterApply(clust, nameslist, function(nms) { x <- lapply(nms, function(n) cl$new()) names(x) <- nms assign('lcl4', x, pos=.GlobalEnv) NULL }) # Insert data into lcl4 on each worker clusterApply(clust, nameslist, function(nms) { lapply(nms, FUN = function(n) lcl4[[n]]$insert(ldt[[n]])) NULL }) # Concatenate lcl4 from each worker lcl4 <- do.call('c', clusterEvalQ(clust, lcl4)) 

This is very similar to the previous method, except that it divides the process into three stages: initializing the workers, completing the task, and finding the result. I also initialized the workers in a more traditional way using clusterExport and clusterEvalQ .

+3
source

I think the version of Windows mclapply works because it delegates its job lapply . Checking the timings or using the main processor core confirms this. According to source R , Windows mclapply and mcmapply are replaced by successive versions.

It seems that something is not how the code is being parallelized, it is not visible what exactly is at the moment.

0
source

Source: https://habr.com/ru/post/987739/


All Articles