Improve the search for the first occurrence of events

I work with large time series data.table, 60 * B * illion rows X 50 Columns

For three specific columns, I would like to add a corresponding T / F column denoting idCol first time each event occurs

In other words, for ColumnA, the new column will be

 DT[, flag.ColumnA := dateCol==min(dateCol) , by=list(idCol, ColumnA)] 

HOWEVER: Often there are bindings for min(dateCol) , and permission for relationships is that only one item will be marked TRUE , the remaining FALSE . This leads to the following approach

 ## Set key to {idCol, dateCol} so that the first row in each group ## is the unique element in that group that should be set to TRUE setkey(DT, idCol, dateCol) DT[, flag.ColumnA := FALSE] DT[, { DT[ .I[[1L]], flag.ColumnA := TRUE] } # braces here are just for easier reading , by=list(idCol, ColumnA)] 

The problem is that this second method increases execution time over 3x, while the first method already takes an hour per column (in a relatively fast box)

I also considered manually resolving the relationships in method 1, but it was slower than the two above methods.

Any suggestions on how to accomplish this task effectively? Examples of data below


Expected Result Example

 DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)] idCol dateCol ColumnA ColumnB flag.ColumnA.M1 flag.ColumnA.M2 1: ID_01 2013-06-01 BT xxx TRUE TRUE <~~ M1 is WRONG, M2 is correct 2: ID_01 2013-06-01 BT www TRUE FALSE <~~ M1 is WRONG, M2 is correct 3: ID_01 2013-06-01 BT yyy TRUE FALSE <~~ M1 is WRONG, M2 is correct 4: ID_01 2013-06-22 BT xxx FALSE FALSE 5: ID_01 2013-11-23 BT yyy FALSE FALSE 6: ID_01 2013-11-30 BT zzz FALSE FALSE 7: ID_01 2013-06-15 CK www TRUE TRUE 8: ID_01 2013-06-15 CK uuu TRUE FALSE 9: ID_01 2013-06-15 CK www TRUE FALSE 10: ID_01 2013-06-29 CK zzz FALSE FALSE 11: ID_01 2013-10-12 CK vvv FALSE FALSE 12: ID_01 2013-11-02 CK uuu FALSE FALSE 13: ID_01 2013-06-22 MH uuu TRUE TRUE 14: ID_01 2013-06-22 MH xxx TRUE FALSE 15: ID_01 2013-06-22 MH zzz TRUE FALSE 16: ID_01 2013-08-24 MH ttt FALSE FALSE 17: ID_01 2013-09-07 MH xxx FALSE FALSE 18: ID_01 2013-09-14 MH zzz FALSE FALSE 19: ID_01 2013-09-21 MH vvv FALSE FALSE 20: ID_01 2013-11-30 MH ttt FALSE FALSE 

Data examples

 # increase N for realistic test N <- 2e4 # N should be large, as certain methods will be seemingly fast but wont scale ids <- sprintf("ID_%02d", seq(5)) A <- apply(expand.grid(LETTERS, LETTERS), 1, paste0, collapse="") B <- paste0(letters, letters, letters)[20:26] dates <- seq.Date(as.Date("2013-06-01"), as.Date("2013-12-01"), by=7) set.seed(1) DT <- data.table( dateCol=sample(dates, N, TRUE) , idCol =sample(ids, N, TRUE) , ColumnA=sample(A, N, TRUE) , ColumnB=sample(B, N, TRUE) , key="idCol") { cat("\n==========\nMETHOD ONE:\n") print(system.time({ DT[, flag.ColumnA.M1 := dateCol==min(dateCol) , by=list(idCol, ColumnA)]})) cat("\n\n==========\nMETHOD TWO:\n") print(system.time({ setkey(DT, idCol, dateCol) DT[, flag.ColumnA.M2 := FALSE] DT[, { DT[ .I[[1L]], flag.ColumnA.M2 := TRUE] } # braces here are just for easier reading , by=list(idCol, ColumnA)]})) } ## For Example, looking at ID_01, at a few select values of ColumnA: DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)] 
+6
source share
2 answers

Just use which.min to resolve the links:

 DT[, flag := FALSE] DT[DT[, .I[which.min(dateCol)], by = list(idCol, ColumnA)]$V1, flag := TRUE] 

For your small sample of data, this is instant for me, as in the immeasurable system.time , and it is 1.5 times faster at N=1e7 than your method 1. I have not tested more N.

+4
source

I used set instead of := inside [.data.table . When set can create columns, it will be faster.

something like (binding by id and date to ensure ordering is correct)

 system.time({ set.seed(1) DT <- data.table( dateCol=sample(dates, N, TRUE) , idCol =sample(ids, N, TRUE) , ColumnA=sample(A, N, TRUE) , ColumnB=sample(B, N, TRUE) , key=c("idCol", "dateCol")) ll <- lapply(c('ColumnA','ColumnB'), function(cc) DT[,.I[1],by = c('idCol',cc)][['V1']]) flags <- c('flagA','flagB') DT[, (flags) := FALSE] jflag <- match(flags, names(DT), nomatch=0) for(jj in seq_along(jflag)){ set(DT, i = ll[[jj]], j = jflag[jj], value = TRUE) } }) # See this is lightening fast (even incorporating the creation of the data.table) ## user system elapsed ## 0.02 0.00 0.02 DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)] idCol dateCol ColumnA ColumnB flagA flagB 1: ID_01 2013-06-01 BT xxx TRUE TRUE 2: ID_01 2013-06-01 BT www FALSE FALSE 3: ID_01 2013-06-01 BT yyy FALSE FALSE 4: ID_01 2013-06-22 BT xxx FALSE FALSE 5: ID_01 2013-11-23 BT yyy FALSE FALSE 6: ID_01 2013-11-30 BT zzz FALSE FALSE 7: ID_01 2013-06-15 CK www TRUE FALSE 8: ID_01 2013-06-15 CK uuu FALSE FALSE 9: ID_01 2013-06-15 CK www FALSE FALSE 10: ID_01 2013-06-29 CK zzz FALSE FALSE 11: ID_01 2013-10-12 CK vvv FALSE FALSE 12: ID_01 2013-11-02 CK uuu FALSE FALSE 13: ID_01 2013-06-22 MH uuu TRUE FALSE 14: ID_01 2013-06-22 MH xxx FALSE FALSE 15: ID_01 2013-06-22 MH zzz FALSE FALSE 16: ID_01 2013-08-24 MH ttt FALSE FALSE 17: ID_01 2013-09-07 MH xxx FALSE FALSE 18: ID_01 2013-09-14 MH zzz FALSE FALSE 19: ID_01 2013-09-21 MH vvv FALSE FALSE 20: ID_01 2013-11-30 MH ttt FALSE FALSE 

Other options include just one pass through

for example (it would be faster if you each time knew the column number of the added column (what you need in your situation, you won’t need to match flag columns for the data.table hten names)

 lapply(c('A','B'), function(LL){ cn <- sprintf('Column%s',LL) fl <- sprintf('flag%s',LL) DT[, (fl) :=FALSE] is <- DT[,.I[1],by =c('idCol',cn)][['V1']] jm <- match(fl, names(DT), nomatch=0) set(DT, i=is, j=jm, value=TRUE) invisible() }) 

or

 for(ff in seq_along(flags)){ is <- DT2[,.I[1], by =c('idCol',cols[1])][['V1']] set(DT2, i = is, j = jflag[ff], value = TRUE) } 
+3
source

Source: https://habr.com/ru/post/958426/


All Articles