Re-fetch from object id in R

Question

Re-fetch from object id in R

Suppose we have the following data:

set.seed(123) dat <- data.frame(var1=c(10,35,13,19,15,20,19), id=c(1,1,2,2,2,3,4)) (sampledIDs <- sample(min(dat$id):max(dat$id), size=3, replace=TRUE)) > [1] 2 4 2

The sampled identifiers are the identifier vector, which is selected (with replacement) from dat$id . I need code that leads (and works also for a large data set with a lot of variables):

  var1 id 13 2 19 2 15 2 19 4 13 2 19 2 15 2

The code dat[which(dat$id%in%sampledIDs),] does not give me what I want, since the result of this code

  var1 id 13 2 19 2 15 2 19 4

when the subject with dat$id==2 appears only once in this data (I understand why this is the result, but I don’t know how to get what I want). Can anybody help?

EDIT : Thanks for the answers, here is the lead time for all the answers (for those who are interested):

  test replications elapsed relative user.self 3 dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] 1000 0.67 1.000 0.64 1 dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[, 1], ] 1000 0.67 1.000 0.67 2 do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)]) 1000 1.83 2.731 1.83 4 setkey(setDT(dat), id)[J(sampledIDs)] 1000 1.33 1.985 1.33

+6

matrix r sampling resampling

Giuseppe Nov 03 '14 at 11:21

source share

3 answers

You can do:

 do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])

+3

flodel Nov 03 '14 at 11:28

source share

One approach:

 dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] # var1 id # 3 13 2 # 4 19 2 # 5 15 2 # 7 19 4 # 3.1 13 2 # 4.1 19 2 # 5.1 15 2

Alternative approach:

 dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]

+3

Sven hohenstein Nov 03 '14 at 11:30

source share

David Arenburg · Accepted Answer · 2014-11-03T11:49:22+0000

This would be the fastest approach for a large dataset using data.table binary search

 library(data.table) setkey(setDT(dat), id)[J(sampledIDs)] # var1 id # 1: 13 2 # 2: 19 2 # 3: 15 2 # 4: 19 4 # 5: 13 2 # 6: 19 2 # 7: 15 2

Edit: Here is a benchmark for a not-so-large dataset (1e + 05 rows) that illustrates what is the clear winner

 library(data.table) library(microbenchmark) set.seed(123) n <- 1e5 dat <- data.frame(var1 = sample(seq_len(100), n, replace = TRUE), id = sample(seq_len(10), n, replace = TRUE)) (sampledIDs <- sample(min(dat$id) : max(dat$id), size = 3, replace = TRUE)) dat2 <- copy(dat) Sven1 <- function(dat) dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] Sven2 <- function(dat) dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ] flodel <- function(dat) do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)]) David <- function(dat2) setkey(setDT(dat2), id)[J(sampledIDs)] Res <- microbenchmark(Sven1(dat), Sven2(dat), flodel(dat), David(dat2)) Res # Unit: milliseconds # expr min lq median uq max neval # Sven1(dat) 4.356151 4.817557 6.715533 7.313877 45.407768 100 # Sven2(dat) 9.750984 12.385677 14.324671 16.655005 54.797096 100 # flodel(dat) 36.097602 39.680006 42.236017 44.314981 82.261879 100 # David(dat2) 1.813387 2.068749 2.154774 2.335442 8.665379 100 boxplot(Res)

If, for example, we would like to try more than just 3 identifiers, but say 10, the reference becomes ridiculous

 (sampledIDs <- sample(min(dat$id) : max(dat$id), size = 10, replace = TRUE)) [1] 7 6 10 9 5 9 5 3 7 3 # Unit: milliseconds # expr min lq median uq max neval # Sven1(dat) 80.124502 89.141162 97.908365 104.111738 175.40919 100 # Sven2(dat) 99.010410 127.797966 159.404395 170.751069 209.96887 100 # flodel(dat) 129.722435 144.847505 157.737362 178.242103 232.41293 100 # David(dat2) 2.431682 2.721038 2.855103 3.057796 19.60826 100

Re-fetch from object id in R

More articles: