I work with large time series data.table, 60 * B * illion rows X 50 Columns
For three specific columns, I would like to add a corresponding T / F column denoting idCol first time each event occurs
In other words, for ColumnA, the new column will be
DT[, flag.ColumnA := dateCol==min(dateCol) , by=list(idCol, ColumnA)]
HOWEVER: Often there are bindings for min(dateCol) , and permission for relationships is that only one item will be marked TRUE , the remaining FALSE . This leads to the following approach
#
The problem is that this second method increases execution time over 3x, while the first method already takes an hour per column (in a relatively fast box)
I also considered manually resolving the relationships in method 1, but it was slower than the two above methods.
Any suggestions on how to accomplish this task effectively? Examples of data below
Expected Result Example
DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)] idCol dateCol ColumnA ColumnB flag.ColumnA.M1 flag.ColumnA.M2 1: ID_01 2013-06-01 BT xxx TRUE TRUE <~~ M1 is WRONG, M2 is correct 2: ID_01 2013-06-01 BT www TRUE FALSE <~~ M1 is WRONG, M2 is correct 3: ID_01 2013-06-01 BT yyy TRUE FALSE <~~ M1 is WRONG, M2 is correct 4: ID_01 2013-06-22 BT xxx FALSE FALSE 5: ID_01 2013-11-23 BT yyy FALSE FALSE 6: ID_01 2013-11-30 BT zzz FALSE FALSE 7: ID_01 2013-06-15 CK www TRUE TRUE 8: ID_01 2013-06-15 CK uuu TRUE FALSE 9: ID_01 2013-06-15 CK www TRUE FALSE 10: ID_01 2013-06-29 CK zzz FALSE FALSE 11: ID_01 2013-10-12 CK vvv FALSE FALSE 12: ID_01 2013-11-02 CK uuu FALSE FALSE 13: ID_01 2013-06-22 MH uuu TRUE TRUE 14: ID_01 2013-06-22 MH xxx TRUE FALSE 15: ID_01 2013-06-22 MH zzz TRUE FALSE 16: ID_01 2013-08-24 MH ttt FALSE FALSE 17: ID_01 2013-09-07 MH xxx FALSE FALSE 18: ID_01 2013-09-14 MH zzz FALSE FALSE 19: ID_01 2013-09-21 MH vvv FALSE FALSE 20: ID_01 2013-11-30 MH ttt FALSE FALSE
Data examples
# increase N for realistic test N <- 2e4 # N should be large, as certain methods will be seemingly fast but wont scale ids <- sprintf("ID_%02d", seq(5)) A <- apply(expand.grid(LETTERS, LETTERS), 1, paste0, collapse="") B <- paste0(letters, letters, letters)[20:26] dates <- seq.Date(as.Date("2013-06-01"), as.Date("2013-12-01"), by=7) set.seed(1) DT <- data.table( dateCol=sample(dates, N, TRUE) , idCol =sample(ids, N, TRUE) , ColumnA=sample(A, N, TRUE) , ColumnB=sample(B, N, TRUE) , key="idCol") { cat("\n==========\nMETHOD ONE:\n") print(system.time({ DT[, flag.ColumnA.M1 := dateCol==min(dateCol) , by=list(idCol, ColumnA)]})) cat("\n\n==========\nMETHOD TWO:\n") print(system.time({ setkey(DT, idCol, dateCol) DT[, flag.ColumnA.M2 := FALSE] DT[, { DT[ .I[[1L]], flag.ColumnA.M2 := TRUE] } # braces here are just for easier reading , by=list(idCol, ColumnA)]})) } ## For Example, looking at ID_01, at a few select values of ColumnA: DT["ID_01"] [ColumnA %in% c("BT", "CK", "MH")] [order(ColumnA, dateCol)]