An elegant way to define spaces inside data.table

Question

An elegant way to define spaces inside data.table

I encountered this problem twice in the last two weeks, so I decided it was worth it. I am trying to define "runs" inside data.table , but I cannot find an elegant way to do this.

Example

 set.seed(2016) dt <- data.table(ID = 1:50, Char = sample(LETTERS, 50, replace=TRUE)) dt <- dt[order(Char, ID)] ID Char 1: 9 A 2: 10 B 3: 20 C 4: 42 C 5: 2 D 6: 4 D 7: 6 D 8: 18 D ...

Here I would like to identify and group the lines where the identifier is within 2 lines above / below. Here is my real ugly decision

 # Runs of 2 or more IDs within 2 of each other dt[, `:=`(InRun = FALSE, InRunStart = FALSE)] dt[abs(ID - shift(ID, type="lag")) <= 2 | abs(shift(ID, type="lead") - ID) <= 2, InRun := TRUE] dt[InRun == TRUE & abs(ID - shift(ID, type="lag")) > 2 | is.na(shift(ID, type="lag")), InRunStart := TRUE] dt[InRun == TRUE, RunID := cumsum(InRunStart)] dt[, c("InRun", "InRunStart") := NULL] dt ID Char RunID 1: 9 A 1 2: 10 B 1 3: 20 C NA 4: 42 C NA 5: 2 D 2 6: 4 D 2 7: 6 D 2 8: 18 D NA ...

Is there a better way to do this?

EDIT: There seems to be some confusion about how I define "start". To do this more explicitly, row_i and row_i + 1 must have the same RunID if and only if their identifiers are at a distance of 2.

+5

r data.table

Ben Nov 15 '16 at 1:06

source share

2 answers

I don’t know whether it is elegant or not, but what about:

 dt <- data.table(ID = c(9, 10, 15, 18, 21, 22, 25)) run_ids <- abs(dt[1:(.N-1), ID] - dt[2:.N, ID]) <= 2 run_ids <- c(run_ids[1], run_ids) foo <- with(rle(run_ids), rep(cumsum(values) * values, lengths)) foo[foo == 0] = foo[which(foo == 0) + 1] dt[, RunID := foo] dt[RunID == 0, RunID := NA] # ID RunID # 1: 9 1 # 2: 10 1 # 3: 15 NA # 4: 18 NA # 5: 21 2 # 6: 22 2 # 7: 25 NA

+1

John smith Nov 15 '16 at 18:59

source share

Frank · Accepted Answer · 2016-11-15T19:10:27+0000

I would stop after creating this startup ID:

 dt[, run_id0 := 1L + cumsum(abs(ID - shift(ID, fill=ID[1L])) > 2)]

But to get the OP startup ID (which ignores a run of length one), here are a few ways:

 dt[duplicated(run_id0) | duplicated(run_id0, fromLast=TRUE), run_id1 := .GRP, by=run_id0 ] # or dt[, run_len := .N, by=run_id0 ][ run_len > 1L, run_id2 := .GRP, by=run_id0 ]

An elegant way to define spaces inside data.table

Example

More articles: