R conditional grouping of lines and numbering groups

Question

R conditional grouping of lines and numbering groups

I work with data frames for flight movements (~ 1 million rows * 108 variables) and want to group phases during which a certain criterion is fulfilled (i.e. the value of a certain variable). To identify these groups, I want to number them. As a newbie to R, I did this for my business. Now I am looking for a more elegant way. In particular, I would like to bridge the “useless” gaps in group numbering. I provide a simplified example of my dplyr data frame with a THR value for a threshold criterion. Lines are sorted by timestamp (and therefore, I can truncate this here).

THR <- c(13,17,19,22,21,19,17,12,12,17,20,20,20,17,17,13, 20,20,17,13)
df  <- as.data.frame(THR)
df  <- tbl_df(df)

To mark all lines where (non) criteria are met

df  <- mutate(df, CRIT = THR < 19)

At the following, I was able to conditionally "cumsum" get a unique group identification:

df <- mutate(df, GRP = ifelse(CRIT == 1, 0, cumsum(CRIT))
df
    x CRIT GRP
1  13 TRUE   0
2  17 TRUE   0
3  19 FALSE  2          
4  22 FALSE  2
5  21 FALSE  2
6  19 FALSE  2
7  17 TRUE   0
8  12 TRUE   0
9  12 TRUE   0
10 17 TRUE   0
11 20 FALSE  6
12 20 FALSE  6

Although this is a trick, and I can work with groups with group_by (for example, summarize, filter), the numbering is not perfect, as you can see in the output example. In this example, the 1st number is numbered 2, and the second group is numbered 6, which corresponds to the result of cumsum ().

I would appreciate it if someone could shed light on me. I could not find a suitable solution in other posts.

+4

r grouping

Rainer Sep 7 '15 at 10:55

source share

2 answers

CRIT, cumsum , cumsum/diff , , , , NA , . data.table ( df <- tbl_df(df))

library(data.table)
setDT(df)[, CRIT := cumsum(THR < 19)]
df[THR >= 19, GRP := cumsum(c(0L, diff(CRIT)) != 0L) + 1L]
#     THR CRIT GRP
#  1:  13    1  NA
#  2:  17    2  NA
#  3:  19    2   1
#  4:  22    2   1
#  5:  21    2   1
#  6:  19    2   1
#  7:  17    3  NA
#  8:  12    4  NA
#  9:  12    5  NA
# 10:  17    6  NA
# 11:  20    6   2
# 12:  20    6   2
# 13:  20    6   2
# 14:  17    7  NA
# 15:  17    8  NA
# 16:  13    9  NA
# 17:  20    9   3
# 18:  20    9   3
# 19:  17   10  NA
# 20:  13   11  NA

+1

David Arenburg 07 . '15 11:45

Colonel beauvel · Accepted Answer · 2015-09-07T11:36:20+0000

You can do:

 x = rle(df$CRIT)
 mask = x$values
 x$values[mask] = 0
 x$values[!mask] = cumsum(!x$values[!mask])

 mutate(df, GRP=inverse.rle(x))

#   THR  CRIT GRP
#1   13  TRUE   0
#2   17  TRUE   0
#3   19 FALSE   1
#4   22 FALSE   1
#5   21 FALSE   1
#6   19 FALSE   1
#7   17  TRUE   0
#8   12  TRUE   0
#9   12  TRUE   0
#10  17  TRUE   0
#11  20 FALSE   2
#12  20 FALSE   2
#13  20 FALSE   2
#14  17  TRUE   0
#15  17  TRUE   0
#16  13  TRUE   0
#17  20 FALSE   3
#18  20 FALSE   3
#19  17  TRUE   0
#20  13  TRUE   0

R conditional grouping of lines and numbering groups

More articles: