Simple answer
dt2[.(dt1),as.list(c( place=sample(place,size=2,replace=TRUE) )),by=.EACHI,allow.cartesian=TRUE]
This approach is simple and illustrates data.table functions, such as Cartesian joins and by=.EACHI , but are very slow because for each row of dt1 this (i) fetches and (ii) forces the result to the list.
Quick response
nsamp <- 2 dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI] dt1[.(dt3),paste0("place",1:nsamp):= replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE) ,by=.EACHI]
Using replicate with simplify=FALSE (as in @bgoldst's answer) makes the most sense:
- It returns a list of vectors that matches the
data.table format when creating new columns. replicate is the standard R function for repeated simulations.
Tests. We need to look at various functions and not change dt1 as we move:
# candidate functions frank2 <- function(){ dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI] dt1[.(dt3), replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE) ,by=.EACHI] } david2 <- function(){ indx <- dt1[,.N, id] sim <- dt2[.(indx), replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE) ,by=.EACHI] dt1[, sim[,-1,with=FALSE]] } bgoldst<-function(){ dt1[, replicate(2,ave(id,id,FUN=function(x) sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=F) ] }
which gives
Unit: relative expr min lq mean median uq max neval cld bgoldst() 8.246783 8.280276 7.090995 7.142832 6.579406 5.692655 10 b frank2() 1.042862 1.107311 1.074722 1.152977 1.092632 0.931651 10 a david2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
And if we switch the parameters ...
# new simulation size <- 1e4 nids <- 10 npls <- 1e6:2e6 dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id") dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id] # new benchmarking res <- microbenchmark(frank2(),david2(),times=10) print(res,order="cld",unit="relative")
we see that
Unit: relative expr min lq mean median uq max neval cld david2() 3.3008 3.2842 3.274905 3.286772 3.280362 3.10868 10 b frank2() 1.0000 1.0000 1.000000 1.000000 1.000000 1.00000 10 a
As you would expect, which path is faster — crashing dt1 in david2 or folding frank2 in frank2 — depends on how much information is compressed by flushing.