I work with two lists of data.frames and am currently running something similar to this (a simplified version of what I am doing):
df1 <- data.frame("a","a1","L","R","b","c",1,2,3,4) df2 <- data.frame("a","a1","L","R","b","c",4,4,4,4,4,44) df3 <- data.frame(7,7,7,7) df4 <- data.frame(5,5,5,5,9,9) L1 <- list(df1,df2) L2 <- list(df3,df4) myfun <- function(x,y) { difa = rowSums(abs(x[c(T,F)] - x[c(F,T)])) difb=sum(abs(as.numeric(y[-c(1:6)])[c(T,F)] - as.numeric(y[-c(1:6)])[c(F,T)])) diff <- difa + difb return(diff) } output1 <- mapply(myfun, x = L2, y = L1)
Each list has the same number of data frames, and each data frame from one list corresponds to a data frame in another list. Data frames in one list contain one row, and other data frames in the second list contain a dynamic number of rows; hence the use of sum and rowSums. The number of numeric columns is also dynamic, but always the same between the corresponding data frames.
I want to use parallel processing to speed up the calculation when working with 1-10 million data in the list. I tried the following:
library(parallel) if(detectCores() > 1) {no_cores <- detectCores() - 1} if(.Platform$OS.type == "unix") {ptype <- "FORK"} cl <- makeCluster(no_cores, type = ptype) clusterMap(cl, myfun, x = L2, y = L1) stopCluster(cl)
However, due to the large amount of data that I use, it will quickly fill up the memory. I assume that due to loading all lists of data frames in each cluster? I am new to parallel processing in R and have read that dividing data into pieces according to the number of available cores is required for some parallel functions that do not automatically implement it, so I tried the following, which does not work:
library(parallel) if(detectCores() > 1) {no_cores <- detectCores() - 1} if(.Platform$OS.type == "unix") {ptype <- "FORK"} cl <- makeCluster(no_cores, type = ptype) output1 <- clusterMap(cl, myfun, x = split(L2, ceiling(seq_along(L2)/no_cores)), y = split(L1, ceiling(seq_along(L1)/no_cores))) stopCluster(cl)
Can someone help newbies? Most of the information I read uses parApply / parLapply / etc. I was able to use mcmapply, but since it uses only forking, I cannot use it. My code should work on both unix and windows systems; so my testing is for OS.type to install it in fork.
UPDATE: Therefore, I think it works correctly in the sense that it parses fragments in different clusters, but the data type does not play well with binary operators inside clusters. The problem is that it becomes a list of data lists and is considered non-numeric in clusters.