I have data.table and I'm trying to do something similar to data[ !is.na(variable) ] . However, for groups that are completely absent, I would just like to keep the first line of this group. So I'm trying to use a subset. I did some research on the Internet and got a solution, but I believe that it is ineffective.
I gave an example below, showing what I hope to achieve, and I wonder if this can be done without creating two additional columns.
d_sample = data.table( ID = c(1, 1, 2, 2, 3, 3), Time = c(10, 15, 100, 110, 200, 220), Event = c(NA, NA, NA, 1, 1, NA)) d_sample[ !is.na(Event), isValidOutcomeRow := T, by = ID] d_sample[ , isValidOutcomePatient := any(isValidOutcomeRow), by = ID] d_sample[ is.na(isValidOutcomePatient), isValidOutcomeRow := c(T, rep(NA, .N - 1)), by = ID] d_sample[ isValidOutcomeRow == T ]
EDIT: Below are some speed comparisons with thelatemail and Frank solutions with a larger dataset with 60K lines.
d_sample = data.table( ID = sort(rep(seq(1,30000), 2)), Time = rep(c(10, 15, 100, 110, 200, 220), 10000), Event = rep(c(NA, NA, NA, 1, 1, NA), 10000) )
Thelatemail solution gets a 20.65 on my computer.
system.time(d_sample[, if(all(is.na(Event))) .SD[1] else .SD[!is.na(Event)][1], by=ID])
Frank's first first decision gets runtime 0
system.time( unique( d_sample[order(is.na(Event))], by="ID" ) )
Frank's second solution gets a runtime of 0.05
system.time( d_sample[order(is.na(Event)), .SD[1L], by=ID] )