Assign columns parallel to the data table.

I would like to assign many (up to 2000+) columns to data.table; the process amazed me that it is highly parallelizable, but it seems that the process is poorly processed, extending the same data.tableto many workers.

I expected the following to work:

library(data.table)
library(parallel)

NN = 100
JJ = 100

cl = makeCluster(2)
DT = data.table(seq_len(NN))
alloc.col(DT, 1.5*JJ)

clusterExport(cl, c("DT", "NN", "JJ"))
clusterEvalQ(cl, library(data.table))

parLapply(cl, seq_len(JJ), function(jj) {
  set(DT, , paste0("V", jj), rnorm(NN))
})

stopCluster(cl)

However, this creates an unclear error:

Error in checkForRemoteErrors(val): 2 nodes caused errors; first error: internal error, report (including result sessionInfo()) for datatable-help: oldtncol (0) <oldncol (1), but the class tl is marked.

I guess this is due to how the assignment by reference works. Assignment occurs in each thread, but it does not return to DTin the global environment.

data.table?

+4
1

Linux (Ubuntu 16.04) Linux. (: mcapply ) ,

> DT <- do.call("cbind",
               mclapply(seq_len(JJ), function(jj) {
  set(DT, , paste0("V", jj), rnorm(NN))
}, mc.cores = detectCores()))

12

NN = 100000
JJ = 100


  1,172 2,756 41,707

NN = 100
  JJ = 2000


  4.060 11.152 24.101

NN = 1000
  JJ = 2000


  6.580 15.712 139.967

- , 2 600 ( ), ,

system.time(
  DT2 <- as.data.table(matrix(rnorm(NN*JJ), ncol = JJ))
)
+1

Source: https://habr.com/ru/post/1668070/


All Articles