Fast subset in data.table in R

Question

Fast subset in data.table in R

Given data.table , I would like to quickly pick up the elements in it. For instance:

 dt = data.table(a=1:10, key="a") dt[a > 3 & a <= 7]

This is pretty slow. I know I can do joins to get individual rows, but is there a way to sort data.table to get fast subsets of this type?

This is what I do:

 dt1 = data.table(id = 1, ym = c(199001, 199006, 199009, 199012), last_ym = c(NA, 199001, 199006, 199009), v = 1:4, key=c("id", "ym")) dt2 = data.table(id = 1, ym = c(199001, 199002, 199003, 199004, 199005, 199006, 199007, 199008, 199009, 199010, 199011, 199012), v2 = 1:12, key=c("id","ym"))

For each id there are only 1 and ym in dt1 , I would like to sum the values of v2 between the current ym in dt1 and the last ym in dt1 . That is, for ym == 199006 in dt1 I would like to return list(v2 = 2 + 3 + 4 + 5 + 6) . These are the v2 values in dt2 that are equal to or less than the current ym (excluding the previous ym). In code:

 expr = expression({ #browser(); cur_id = id; cur_ym = ym; cur_dtb = dt2[J(cur_id)][ym <= cur_ym & ym > last_ym]; setkey(cur_dtb , ym); list(r = sum(cur_dtb$v2)) }) dt1[,eval(expr ),by=list(id, ym)]

+4

join r data.table subset

Alex Jul 05 '13 at 18:15

source share

2 answers

Regardless of the sorting of data.table , you will be limited by the amount of time it takes to evaluate a > 3 & a <= 7 , first of all:

 > dt = data.table(a=1:10000000, key="a") > system.time(dt$a > 3 & dt$a <= 7) user system elapsed 0.18 0.01 0.20 > system.time(dt[,a > 3 & a <= 7]) user system elapsed 0.18 0.05 0.24 > system.time(dt[a > 3 & a <= 7]) user system elapsed 0.25 0.07 0.31

Alternative approach:

 > system.time({Indices = dt$a > 3 & dt$a <= 7 ; dt[Indices]}) user system elapsed 0.28 0.03 0.31

Multiple subsets

There may be a problem with speed if you decompose the factors on a one-time basis, rather than doing it all at once:

 > dt <- data.table(A=sample(letters, 10000, replace=T)) > system.time(for(i in unique(dt$A)) dt[A==i]) user system elapsed 5.16 0.42 5.59 > system.time(dt[,.SD,by=A]) user system elapsed 0.32 0.03 0.36

+1

Señor o Jul 05 '13 at 18:46

source share

G. grothendieck · Accepted Answer · 2013-07-05T20:02:01+0000

To avoid a logical state, make a sliding connection of dt1 and dt2 . Then move ym forward one position within id . Finally, we sum over v2 through id and ym :

 setkey(dt1, id, last_ym) setkey(dt2, id, ym) dt1[dt2,, roll = TRUE][ , list(v2 = v2, ym = c(last_ym[1], head(ym, -1))), by = id][ , list(v2 = sum(v2)), by = list(id, ym)]

Note that we want to summarize everything with last_ym , so the key on dt1 should be last_ym , not ym .

Result:

  id ym v2 1: 1 199001 1 2: 1 199006 20 3: 1 199009 24 4: 1 199012 33

UPDATE: correction

Fast subset in data.table in R

More articles: