Assign columns parallel to the data table.

Question

Assign columns parallel to the data table.

I would like to assign many (up to 2000+) columns to data.table; the process amazed me that it is highly parallelizable, but it seems that the process is poorly processed, extending the same data.tableto many workers.

I expected the following to work:

library(data.table)
library(parallel)

NN = 100
JJ = 100

cl = makeCluster(2)
DT = data.table(seq_len(NN))
alloc.col(DT, 1.5*JJ)

clusterExport(cl, c("DT", "NN", "JJ"))
clusterEvalQ(cl, library(data.table))

parLapply(cl, seq_len(JJ), function(jj) {
  set(DT, , paste0("V", jj), rnorm(NN))
})

stopCluster(cl)

However, this creates an unclear error:

Error in checkForRemoteErrors(val): 2 nodes caused errors; first error: internal error, report (including result sessionInfo()) for datatable-help: oldtncol (0) <oldncol (1), but the class tl is marked.

I guess this is due to how the assignment by reference works. Assignment occurs in each thread, but it does not return to DTin the global environment.

data.table?

+4

parallel-processing r data.table parallel-foreach

MichaelChirico 28 . '17 1:42

1

Floris Padt · Answer 1 · 2017-02-05T13:10:51+0000

Linux (Ubuntu 16.04) Linux. (: mcapply ) ,

> DT <- do.call("cbind",
               mclapply(seq_len(JJ), function(jj) {
  set(DT, , paste0("V", jj), rnorm(NN))
}, mc.cores = detectCores()))

12

NN = 100000
JJ = 100

  1,172 2,756 41,707
NN = 100
  JJ = 2000

  4.060 11.152 24.101
NN = 1000
  JJ = 2000

  6.580 15.712 139.967

- , 2 600 ( ), ,

system.time(
  DT2 <- as.data.table(matrix(rnorm(NN*JJ), ncol = JJ))
)

Assign columns parallel to the data table.

More articles: