I am looking for a faster way to figure out such things. (the real database has 109 964 694 rows).
DT<-data.table(id=c(1,2,1,4,2,1,7,8,8,10),
effect=c("A","A","B","B","B","C","C","C","A","D"),
value=1)
I would like to have an id with effect =="A"
ANDeffect == "C"
intersect(DT[effect=="A",(id)], DT[effect=="C",(id)])
Is there an even faster way to do this?
I would also like to have an identifier with effect =="A"
AND effect =="C"
.. but not witheffect=="B"
setdiff(
intersect(DT[effect=="A",(id)],DT[effect=="C",(id)]),
DT[effect=="B",(id)]
)
any better (faster) idea?
Hello
PS: I tried things like this:
DT[,effect=="diag_998" | effect=="diag_1900",by=id][,sum(V1)==2,by=id][V1==TRUE,(id)]
but he is very very slow ...
EDIT: thanks for your suggestion.
nothing works faster than V1 and A1:
system.time(V1<-intersect(DT[effect=="A",(id)],DT[effect=="C",(id)]))# 0.014 sec
system.time(V2<-DT[effect=="A" | effect=="C", unique(id[duplicated(id)])]) #17sec
system.time(V3<-DT[,list(cond=all(c("A","C") %in% effect)),by=id][cond==TRUE,id] ) #more than 1 min
system.time(V4<-DT[effect=="A" | effect=="C", .N[.N > 1], by = id]$id) # 17 sec
system.time(V5<-DT[, .GRP[sum(c("A", "C") %chin% effect)==2], id]$id) # 103 sec
system.time(V6<-DT[, .GRP[sum(c("A", "C") %in% effect)==2], id]$id)#more than 1 min
setkey(DT, effect)
system.time(V7<-DT[.(c("A", "C")), if (.N > 1) TRUE, by = id]$id)#0.19 sec
system.time(A1<-setdiff(intersect(DT[effect=="A",(id)],DT[effect=="C",(id)]),DT[effect=="D",(id)])) # 0.014 sec
system.time(A2<-DT[,list(cond=all(c("A","C") %in% effect) & (!"D" %in% effect)),by=id][cond==TRUE,id])#more than 1 min
system.time(A3<-DT[,list(cond=all(c("A","C") %chin% effect) & (!"D" %chin% effect)),by=id][cond==TRUE,id])
system.time(A4<-DT[.(c("A", "C", "D")), if (.N == 2 & !("D" %in% effect)) TRUE, by = id]$id)# 1sec
EDIT: another test thanks to @MichaelChirico
microbenchmark(times=50L,
+ bakal=intersect(DT[effect=="A", id], DT[effect=="C", id]),
+ bakal.keyed=intersect(DT["A", id], DT["C", id]),
+ rscr1=DT[effect %in% c("A","C"), unique(id[duplicated(id)])],
+ rscr1.keyed=DT[.(c("A","C")), unique(id[duplicated(id)])])
Unit: milliseconds
expr min lq mean median uq max neval
bakal 10.963171 11.003494 11.072844 11.019909 11.060331 12.641751 50
bakal.keyed 10.738537 10.794715 10.878960 10.838630 10.892020 12.416713 50
rscr1 9.504886 9.572026 9.662024 9.598426 9.645478 11.127047 50
rscr1.keyed 9.013076 9.037370 9.065215 9.065669 9.083492 9.206366 50