(Effectively) merging random subsets

Question

(Effectively) merging random subsets

I have two data.table s; I would like to assign an element of one from another randomly from among those that correspond to the keys. The way I'm doing it now is pretty slow.

Let it turn out concrete; here are some sample data:

 dt1<-data.table(id=sample(letters[1:5],500,replace=T),var1=rnorm(500),key="id") dt2<-data.table(id=c(rep("a",4),rep("b",8),rep("c",2),rep("d",5),rep("e",7)), place=paste(sample(c("Park","Pool","Rec Center","Library"), 26,replace=T), sample(26)),key="id")

I want to add two randomly selected place in dt1 for each observation, but place must match id .

Here is what I am doing now:

 get_place<-function(xx) sapply(xx,function(x) dt2[.(x),sample(place,1)]) dt1[,paste0("place",1:2):=list(get_place(id),get_place(id))]

This works, but it is rather slow - it took 66 seconds to run on my computer, mostly aeon.

One problem seems to be that I cannot use the key correctly:

Something like dt2[.(dt1$id),mult="random"] would be ideal, but this is not possible.

Any suggestions?

+6

r data.table

MichaelChirico May 20, '15 at 18:51

source share

3 answers

The ideal function for this purpose is ave() , because it allows you to run a function for each group of the vector and automatically displays the return value back to the elements of the group:

 set.seed(1); dt1 <- data.table(id=sample(letters[1:5],500,replace=T), var1=rnorm(500), key='id' ); dt2 <- data.table(id=c(rep('a',4),rep('b',8),rep('c',2),rep('d',5),rep('e',7)), place=paste(sample(c('Park','Pool','Rec Center','Library'),26,replace=T), sample(26) ), key='id' ); dt1[,paste0('place',1:2):=replicate(2,ave(id,id,FUN=function(x) sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=FALSE)] dt1; ## id var1 place1 place2 ## 1: a -0.4252677 Rec Center 23 Park 12 ## 2: a -0.3892372 Park 12 Library 22 ## 3: a 2.6491669 Park 14 Rec Center 23 ## 4: a -2.2891240 Rec Center 23 Park 14 ## 5: a -0.7012317 Library 22 Park 12 ## --- ## 496: e -1.0624084 Library 16 Library 16 ## 497: e -0.9838209 Library 4 Library 26 ## 498: e 1.1948510 Library 26 Pool 21 ## 499: e -1.3353714 Pool 18 Library 26 ## 500: e 1.8017255 Park 20 Pool 21

This should work with data.frame as well as data.table s.

Edit: adding benchmarking

This decision seems the fastest, at least after the correction proposed by Frank below is made.

 frank<-function(){dt2[.(dt1),as.list(c( place=sample(place,size=2,replace=TRUE))), by=.EACHI,allow.cartesian=TRUE]} david<-function(){ dt1[,paste0("place",1:2):= lapply(1:2,function(x) get_place(id,.N)),by=id]} bgoldst<-function(){dt1[,paste0("place",1:2):= replicate(2,ave(id,id,FUN=function(x) sample(dt2$place[dt2$id==x[1]],length(x),replace=T)), simplify=F)]} microbenchmark(times=1000L,frank(),david(),bgoldst()) Unit: milliseconds expr min lq mean median uq max neval cld frank() 5.125843 5.353918 6.276879 5.496042 5.772051 15.57155 1000 b david() 6.049172 6.305768 7.172360 6.455687 6.669202 93.06398 1000 c bgoldst() 1.421330 1.521046 1.847821 1.570573 1.628424 89.60315 1000 a

+3

bgoldst May 20, '15 at 19:22

source share

When you execute sapply on each line, you basically do not use the data.table here. Alternatively, you can use both the binary connection and the by parameter by fetching only once per id . You can define get_place as follows

 get_place <- function(tempid, N) dt2[.(tempid), sample(place, N, replace = TRUE)]

Then just do

 dt1[, place1 := get_place(id, .N), by = id]

Or a general solution would be

 indx <- 1:2 dt1[, paste0("place", indx) := lapply(indx, function(x) get_place(id, .N)), by = id]

Here's a comparative test for a bit larger than dt1

 size = 1e6 set.seed(123) dt1 <- data.table(id=sample(letters[1:5],size,replace=TRUE),var1=rnorm(size),key="id")

Using the same functions as defined in @bgoldst's answer

 microbenchmark(times = 10L, frank(), david(), bgoldst()) # Unit: milliseconds # expr min lq mean median uq max neval # frank() 11627.68324 11771.4227 11887.1232 11804.6342 12012.4636 12238.1031 10 # david() 84.62109 122.1117 121.1003 123.5861 128.0042 132.3591 10 # bgoldst() 372.02267 400.8867 445.6231 421.3168 445.9076 709.5458 10

Here is another, faster option for the same idea (as shown in @Frank test):

 indx<- dt1[,.N, id] sim <- dt2[.(indx),replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE),by=.EACHI] dt1[,paste0("place",1:2):=`[.listof`(sim,-1)]

+3

David Arenburg May 20, '15 at 19:29

source share

Frank · Accepted Answer · 2015-05-20T19:38:56+0000

Simple answer

 dt2[.(dt1),as.list(c( place=sample(place,size=2,replace=TRUE) )),by=.EACHI,allow.cartesian=TRUE]

This approach is simple and illustrates data.table functions, such as Cartesian joins and by=.EACHI , but are very slow because for each row of dt1 this (i) fetches and (ii) forces the result to the list.

Quick response

 nsamp <- 2 dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI] dt1[.(dt3),paste0("place",1:nsamp):= replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE) ,by=.EACHI]

Using replicate with simplify=FALSE (as in @bgoldst's answer) makes the most sense:

It returns a list of vectors that matches the data.table format when creating new columns.
replicate is the standard R function for repeated simulations.

Tests. We need to look at various functions and not change dt1 as we move:

 # candidate functions frank2 <- function(){ dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI] dt1[.(dt3), replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE) ,by=.EACHI] } david2 <- function(){ indx <- dt1[,.N, id] sim <- dt2[.(indx), replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE) ,by=.EACHI] dt1[, sim[,-1,with=FALSE]] } bgoldst<-function(){ dt1[, replicate(2,ave(id,id,FUN=function(x) sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=F) ] } # simulation size <- 1e6 nids <- 1e3 npls <- 2:15 dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id") dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id] # benchmarking res <- microbenchmark(frank2(),david2(),bgoldst(),times=10) print(res,order="cld",unit="relative")

which gives

 Unit: relative expr min lq mean median uq max neval cld bgoldst() 8.246783 8.280276 7.090995 7.142832 6.579406 5.692655 10 b frank2() 1.042862 1.107311 1.074722 1.152977 1.092632 0.931651 10 a david2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

And if we switch the parameters ...

 # new simulation size <- 1e4 nids <- 10 npls <- 1e6:2e6 dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id") dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id] # new benchmarking res <- microbenchmark(frank2(),david2(),times=10) print(res,order="cld",unit="relative")

we see that

 Unit: relative expr min lq mean median uq max neval cld david2() 3.3008 3.2842 3.274905 3.286772 3.280362 3.10868 10 b frank2() 1.0000 1.0000 1.000000 1.000000 1.000000 1.00000 10 a

As you would expect, which path is faster — crashing dt1 in david2 or folding frank2 in frank2 — depends on how much information is compressed by flushing.

(Effectively) merging random subsets

More articles: