I encountered this problem twice in the last two weeks, so I decided it was worth it. I am trying to define "runs" inside data.table , but I cannot find an elegant way to do this.
Example
set.seed(2016) dt <- data.table(ID = 1:50, Char = sample(LETTERS, 50, replace=TRUE)) dt <- dt[order(Char, ID)] ID Char 1: 9 A 2: 10 B 3: 20 C 4: 42 C 5: 2 D 6: 4 D 7: 6 D 8: 18 D ...
Here I would like to identify and group the lines where the identifier is within 2 lines above / below. Here is my real ugly decision
# Runs of 2 or more IDs within 2 of each other dt[, `:=`(InRun = FALSE, InRunStart = FALSE)] dt[abs(ID - shift(ID, type="lag")) <= 2 | abs(shift(ID, type="lead") - ID) <= 2, InRun := TRUE] dt[InRun == TRUE & abs(ID - shift(ID, type="lag")) > 2 | is.na(shift(ID, type="lag")), InRunStart := TRUE] dt[InRun == TRUE, RunID := cumsum(InRunStart)] dt[, c("InRun", "InRunStart") := NULL] dt ID Char RunID 1: 9 A 1 2: 10 B 1 3: 20 C NA 4: 42 C NA 5: 2 D 2 6: 4 D 2 7: 6 D 2 8: 18 D NA ...
Is there a better way to do this?
EDIT: There seems to be some confusion about how I define "start". To do this more explicitly, row_i and row_i + 1 must have the same RunID if and only if their identifiers are at a distance of 2.
source share