I have a dataset indexed by two ID variables (one nested in another) and a date, and I want to calculate moving statistics in this data.
My real data set is large (~ 200 mil lines), and I enjoyed the speed boost using data.table for other tasks ... but I cannot figure out how to use data.table optimally (i.e. use binary search and avoid vector scan) in this task.
Sample data:
set.seed(3)
dt1 <-
data.table(id1=c(rep("a",124),rep("b",124)),
id2=c(rep("x",62),rep("y",62)),
date=seq(as.Date("2012-05-01"),as.Date("2012-07-01"),"days"),
var1=rpois(124,14),
var2=rpois(124,3))
setkey(dt1,id1,id2,date)
dt1 <- dt1[-c(5,10,36,46,58)]
My ultimate goal is to calculate the "rental statistics" for each day within id1 / id2, which is equal to:
amount (var2) / amount (var1)
including all other lines with the same id1 / id2 combination and 30 days before this line.
, , ID Date = 2012-06-12:
dt1[date < as.Date("2012-06-12") & date > as.Date("2012-06-12")-31,
list("newstat"=sum(var1)/sum(var2),
"date"=as.Date("2012-06-12")),by=list(id1,id2)]
id1 id2 newstat date
1: a x 3.925 2012-06-12
2: a y 4.396 2012-06-12
3: b x 3.925 2012-06-12
4: b y 4.396 2012-06-12
id1 id2, ( , ). , , , , . , , , ...
:
dt1[setkey(dt1[,list(id1,id2,"date_grp"=date)],id1,id2),
list(date_grp,date,var1,var2)][
date<date_grp & date > date_grp-30,
list("newstat"=sum(var1)/sum(var2)),
by=list(id1,id2,date_grp)]
:
id1 id2 date_grp newstat
1: a x 2012-05-02 0.4286
2: a x 2012-05-03 0.4000
3: a x 2012-05-04 0.2857
4: a x 2012-05-06 0.2903
5: a x 2012-05-07 0.3056
---
235: b y 2012-06-27 0.2469
236: b y 2012-06-28 0.2354
237: b y 2012-06-29 0.2323
238: b y 2012-06-30 0.2426
239: b y 2012-07-01 0.2304