How to calculate rolling statistics in R using a data table. Unevenly distributed data

I have a dataset indexed by two ID variables (one nested in another) and a date, and I want to calculate moving statistics in this data.

My real data set is large (~ 200 mil lines), and I enjoyed the speed boost using data.table for other tasks ... but I cannot figure out how to use data.table optimally (i.e. use binary search and avoid vector scan) in this task.

Sample data:

set.seed(3)
dt1 <- 
 data.table(id1=c(rep("a",124),rep("b",124)),
            id2=c(rep("x",62),rep("y",62)),
            date=seq(as.Date("2012-05-01"),as.Date("2012-07-01"),"days"),
            var1=rpois(124,14),
            var2=rpois(124,3))
setkey(dt1,id1,id2,date)
# create uneven time spacing
dt1 <- dt1[-c(5,10,36,46,58)]

My ultimate goal is to calculate the "rental statistics" for each day within id1 / id2, which is equal to:

amount (var2) / amount (var1)

including all other lines with the same id1 / id2 combination and 30 days before this line.

, , ID Date = 2012-06-12:

dt1[date < as.Date("2012-06-12") & date > as.Date("2012-06-12")-31,
    list("newstat"=sum(var1)/sum(var2),
         "date"=as.Date("2012-06-12")),by=list(id1,id2)]

   id1 id2 newstat       date
1:   a   x   3.925 2012-06-12
2:   a   y   4.396 2012-06-12
3:   b   x   3.925 2012-06-12
4:   b   y   4.396 2012-06-12

id1 id2, ( , ). , , , , . , , , ...

:

dt1[setkey(dt1[,list(id1,id2,"date_grp"=date)],id1,id2),
    list(date_grp,date,var1,var2)][
      # Here comes slow subset
      date<date_grp & date > date_grp-30,
      list("newstat"=sum(var1)/sum(var2)),
      by=list(id1,id2,date_grp)]

:

     id1 id2   date_grp newstat
  1:   a   x 2012-05-02  0.4286
  2:   a   x 2012-05-03  0.4000
  3:   a   x 2012-05-04  0.2857
  4:   a   x 2012-05-06  0.2903
  5:   a   x 2012-05-07  0.3056
 ---                           
235:   b   y 2012-06-27  0.2469
236:   b   y 2012-06-28  0.2354
237:   b   y 2012-06-29  0.2323
238:   b   y 2012-06-30  0.2426
239:   b   y 2012-07-01  0.2304
+4
1

, , , , - , , :

dt.dates <- dt1[, list(date.join=seq(as.Date(date - 1, origin="1970-01-01"), by="-1 day", len=30)), by=list(date, id1, id2)]

. .

setkey(dt.dates, date.join, id1, id2)
setkey(dt1,date,id1,id2)
dt.dates[dt1][ , sum(var1)/sum(var2), by=list(id1, id2, date)]

6/12, , . :

> dt.date.join[dt1][ , sum(var1)/sum(var2), by=list(id1, id2, date)][date=="2012-06-12"]
   id1 id2       date       V1
1:   a   x 2012-06-12 3.630631
2:   a   y 2012-06-12 4.434783
3:   b   x 2012-06-12 3.634783
4:   b   y 2012-06-12 4.434783
> dt1[date < as.Date("2012-06-12") & date > as.Date("2012-06-12")-31, list("newstat"=sum(var1)/sum(var2), "date"=as.Date("2012-06-12")),by=list(id1,id2)]
   id1 id2  newstat       date
1:   a   x 3.630631 2012-06-12
2:   a   y 4.434783 2012-06-12
3:   b   x 3.634783 2012-06-12
4:   b   y 4.434783 2012-06-12

.

+7

Source: https://habr.com/ru/post/1533438/


All Articles