Identification of events associated with specific groups over a period of time

I have a data set that records categorized events that have two time points t1 and t2, where t2 is equal to aways> = t1. Events are classified z1, z2, z3 or NA (not classified). For each group (grp) I want to identify events where t1 is located for a certain period of any categorized event t2 in the group. In my example, I mentioned these t2 events as key dates. The loop procedure shows what I need, but is hopelessly inefficient in a large dataset. I need this to run on a dataset containing several million rows with more than one million groups.

I also showed my attempts to code this more efficiently using the data.table syntax. My method is to assign a set of key dates for each group to the "ref" vector, then for each row in the group, calculate the difference between t1 and link dates and check this interval (s) for a specific interval, then returns one logical value indicating whether a particular string matches t1 within 30 days from any of the key dates. When I restrict each group to one reference date (accepting the first date [1]), the code works, but when I resolve several key dates, the code returns errors as necessary. It is clear that I do not understand what the data table does in the j-th application. Can someone explain what I got and offer an effective solution to data.table.

data examples

library("data.table")
DT <-read.table(text=
"grp,zcat,t1,t2
a,NA,2007-03-18,2007-03-28
a,z1,2007-08-04,2007-08-14
a,NA,2007-08-21,2007-08-23
a,NA,2007-11-21,2007-11-29
a,z1,2007-12-10,2007-12-13
a,z2,2008-02-16,2008-02-19
a,NA,2008-03-14,2008-03-21
a,NA,2008-05-27,2008-06-03
b,NA,2003-04-22,2003-04-27
b,z3,2003-05-11,2003-05-23
b,z1,2003-07-16,2003-08-02
c,z3,2011-01-18,2011-02-07
c,z3,2011-03-01,2011-03-13
c,NA,2011-03-30,2011-04-11
c,NA,2011-05-21,2011-05-28",
header=TRUE, sep=",", stringsAsFactors=FALSE, na.strings="NA", colClasses="character")
DT <-data.table(DT)
setorder(DT,grp,t1)

grp-a: "2007-08-14" "2007-12-13" "2008-02-19"

grp-b: "2003-05-23" "2003-08-02"

grp-c: "2011-02-07" "2011-03-13"

- ok

out<-c()
for(i in 1:nrow(DT)){
    ref <-DT[grp == grp[i] & !is.na(zcat),t2]
    temp <-as.Date(DT$t1[i]) - as.Date(ref)
    out[i] <-any(temp >=0 & temp <31)
    rm(ref,temp)
    # ref; delta; delta >=0 & delta <31
    if(i==nrow(DT)){DT[, newvar :=out]; rm(out)}
}

.

, , . , j

DT[,{ref=t2[!is.na(zcat)]; delta=as.Date(t1) - as.Date(ref)[1]; delta >0 & delta <30}, by=grp]

DT[,{ref=t2[!is.na(zcat)][1]; delta=as.Date(t1) - as.Date(ref); delta >0 & delta <30}, by=grp]

DT[,{ref=t2[!is.na(zcat)]; delta=as.Date(t1) - as.Date(ref); any(delta >0 & delta <30)}, by=grp]
+4
1

(1.9.8 +):

DT[, `:=`(t1 = as.Date(t1), t2 = as.Date(t2), newvar = FALSE)]

DT[DT[!is.na(zcat), .(grp, t2, t2.end = t2 + 31)],
   on = .(grp, t1 >= t2, t1 < t2.end),
   newvar := TRUE]
+2

Source: https://habr.com/ru/post/1658107/


All Articles