Speeding up data tables.

Question

Speeding up data tables.

I am looking for a faster way to figure out such things. (the real database has 109 964 694 rows).

DT<-data.table(id=c(1,2,1,4,2,1,7,8,8,10),
               effect=c("A","A","B","B","B","C","C","C","A","D"),
               value=1)

I would like to have an id with effect =="A"ANDeffect == "C"

intersect(DT[effect=="A",(id)], DT[effect=="C",(id)])

Is there an even faster way to do this?

I would also like to have an identifier with effect =="A"AND effect =="C".. but not witheffect=="B"

setdiff(
  intersect(DT[effect=="A",(id)],DT[effect=="C",(id)]),
DT[effect=="B",(id)]
)

any better (faster) idea?

Hello

PS: I tried things like this:

DT[,effect=="diag_998" | effect=="diag_1900",by=id][,sum(V1)==2,by=id][V1==TRUE,(id)]

but he is very very slow ...

EDIT: thanks for your suggestion.

nothing works faster than V1 and A1:

system.time(V1<-intersect(DT[effect=="A",(id)],DT[effect=="C",(id)]))# 0.014 sec
system.time(V2<-DT[effect=="A" | effect=="C", unique(id[duplicated(id)])]) #17sec
system.time(V3<-DT[,list(cond=all(c("A","C") %in% effect)),by=id][cond==TRUE,id] ) #more than 1 min
system.time(V4<-DT[effect=="A" | effect=="C", .N[.N > 1], by = id]$id) # 17 sec
system.time(V5<-DT[, .GRP[sum(c("A", "C") %chin% effect)==2], id]$id) # 103 sec

system.time(V6<-DT[, .GRP[sum(c("A", "C") %in% effect)==2], id]$id)#more than 1 min
setkey(DT, effect)
system.time(V7<-DT[.(c("A", "C")), if (.N > 1) TRUE, by = id]$id)#0.19 sec



system.time(A1<-setdiff(intersect(DT[effect=="A",(id)],DT[effect=="C",(id)]),DT[effect=="D",(id)])) # 0.014 sec
system.time(A2<-DT[,list(cond=all(c("A","C") %in% effect) & (!"D" %in% effect)),by=id][cond==TRUE,id])#more than 1 min
system.time(A3<-DT[,list(cond=all(c("A","C") %chin% effect) & (!"D" %chin% effect)),by=id][cond==TRUE,id])
system.time(A4<-DT[.(c("A", "C", "D")), if (.N == 2 & !("D" %in% effect)) TRUE, by = id]$id)# 1sec

EDIT: another test thanks to @MichaelChirico

microbenchmark(times=50L,
+                bakal=intersect(DT[effect=="A", id], DT[effect=="C", id]),
+                bakal.keyed=intersect(DT["A", id], DT["C", id]),
+                rscr1=DT[effect %in% c("A","C"), unique(id[duplicated(id)])],
+                rscr1.keyed=DT[.(c("A","C")), unique(id[duplicated(id)])])
Unit: milliseconds
        expr       min        lq      mean    median        uq       max neval
       bakal 10.963171 11.003494 11.072844 11.019909 11.060331 12.641751    50
 bakal.keyed 10.738537 10.794715 10.878960 10.838630 10.892020 12.416713    50
       rscr1  9.504886  9.572026  9.662024  9.598426  9.645478 11.127047    50
 rscr1.keyed  9.013076  9.037370  9.065215  9.065669  9.083492  9.206366    50

+4

r data.table

Vincent guyader Oct 24 '15 at 17:18

source share

1 answer

MichaelChirico · Accepted Answer · 2015-10-25T19:20:07+0000

You seem to be on the right track; I can only offer a keyboard to speed it up.

, :

set.seed(10239)
NN<-1e6
DT<-data.table(id=sample(8e5,NN,T),
               effect=sample(LETTERS[1:4],NN,T),
               val=rnorm(NN),key="effect")

, keying, , : setkey(DT,effect); DT[.(c("A", "C")), if (.N > 1) TRUE, by = id]$id.

, , ( , , cond==TRUE (cond) @nicola effect=="A"|effect=="C" effect %in% c("A", C") @RichardScriven's):

library(microbenchmark)
microbenchmark(times=50L,
               bakal=intersect(DT[effect=="A", id], DT[effect=="C", id]),
               rscr1=DT[effect %in% c("A","C"), unique(id[duplicated(id)])],
               nicol=DT[,.(cond=all(c("A","C") %in% effect)), by=id][(cond), id],
               rscr2=DT[effect %in% c("A","C"), .N[.N > 1], by = id]$id,
               akrun=DT[, .GRP[sum(c("A", "C") %chin% effect) == 2], id]$id,
               mikec=DT[.(c("A", "C")), if (.N > 1) TRUE, by = id]$id)

:

Unit: milliseconds
  expr        min         lq       mean     median         uq        max neval   cld
 bakal   14.82926   15.18540   17.86200   15.56453   16.70924   64.99443    50 a    
 rscr1   13.41102   13.98252   20.11127   14.93054   18.02248   66.14476    50 a    
 nicol 1329.82013 1377.03884 1436.45650 1404.48956 1483.47944 1758.00831    50     e
 rscr2  260.54888  269.86605  294.05219  276.66802  310.76356  479.50419    50   c  
 akrun  997.43300 1075.17322 1103.06220 1095.08246 1118.16848 1360.80793    50    d 
 mikec  154.39418  158.90884  180.01096  163.32763  206.59246  235.45523    50  b

, data.table , :

set.seed(12039)
NN<-1e8
DT<-data.table(id=sample(8e5,NN,T),
               effect=sample(LETTERS[1:4],NN,T),
               val=rnorm(NN),key="effect")
microbenchmark(times=50L,
               bakal=intersect(DT[effect=="A", id], DT[effect=="C", id]),
               bakal.keyed=intersect(DT["A", id], DT["C", id]),
               rscr1=DT[effect %in% c("A","C"), unique(id[duplicated(id)])],
               rscr1.keyed=DT[.(c("A","C")), unique(id[duplicated(id)])])
Unit: seconds
        expr      min       lq     mean   median       uq      max neval cld
       bakal 3.772309 3.991414 4.395669 4.408319 4.681609 5.170224    50  a 
 bakal.keyed 3.708500 3.807447 4.289518 4.384870 4.653427 5.085429    50  a 
       rscr1 4.962940 5.210845 5.721636 5.707369 6.162103 6.779900    50   b
 rscr1.keyed 4.904702 5.117411 5.727848 5.807186 6.194990 6.975508    50   b

Speeding up data tables.

More articles: