I would like to assign many (up to 2000+) columns to data.table; the process amazed me that it is highly parallelizable, but it seems that the process is poorly processed, extending the same data.tableto many workers.
I expected the following to work:
library(data.table)
library(parallel)
NN = 100
JJ = 100
cl = makeCluster(2)
DT = data.table(seq_len(NN))
alloc.col(DT, 1.5*JJ)
clusterExport(cl, c("DT", "NN", "JJ"))
clusterEvalQ(cl, library(data.table))
parLapply(cl, seq_len(JJ), function(jj) {
set(DT, , paste0("V", jj), rnorm(NN))
})
stopCluster(cl)
However, this creates an unclear error:
Error in checkForRemoteErrors(val): 2 nodes caused errors; first error: internal error, report (including result sessionInfo()) for datatable-help: oldtncol (0) <oldncol (1), but the class tl is marked.
I guess this is due to how the assignment by reference works. Assignment occurs in each thread, but it does not return to DTin the global environment.
data.table?