Subset R dataframe by factor level, deleting ALL rows after the threshold crossed FIRST time

I have some data from a web session where I am trying to exclude ALL observations AFTER some time (say 10 days) from a previous visit. I have an ID, VisitNum and a calculated DateDiff representing the days that have passed since the last visit. My identifier is factors, so I need a solution to work on many levels of factors.

Sample data:

test_data <- data.frame(ID=c("abc123","abc123","abc123","abc123"),
                    VisitNum=c(1,2,3,4),
                    DateDiff=c(0,5,30,5))

Since the 3rd visit occurred 30 days after the second visit, I want to exclude BOTH 3rd and 4th visit from the data frame. The solutions I came up with will exclude the 3rd visit, but leave the 4th, which I don’t need.

My desired result would look like this:

test_results <- data.frame(ID=c("abc123","abc123"),
                       VisitNum=c(1,2),
                       DateDiff=c(0,5))

Thank!

+4
2

R cummin:

test_data[as.logical(cummin(test_data$DateDiff < 10)), ]
      ID VisitNum DateDiff
1 abc123        1        0
2 abc123        2        5

R ave:

test_data[as.logical(ave(test_data$DateDiff, test_data$ID,
                         FUN=function(i) cummin(i < 10))), ]
      ID VisitNum DateDiff
1 abc123        1        0
2 abc123        2        5
6 abc323        2        5
7 abc323        3        5

as.logical, , ave .


data.table

library(data.table
setDT(data.table)
test_data[as.logical(test_data[,cummin(DateDiff < 10), by=ID]$V1)]

< >

test_data <- 
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 2L, 3L, 3L),
.Label = c("abc123", "abc223", "abc323"), class = "factor"),
VisitNum = c(1, 2, 3, 4, 2, 2, 3), DateDiff = c(0, 5, 30, 5, 20, 5, 5)),
Names = c("ID", "VisitNum", "DateDiff"), row.names = c(NA, -7L),
class = "data.frame")
+4

which.

test_data[1:(which(test_data$DateDiff > 10)[1] - 1),]

.

test_data <- data.frame(ID=sample(c("abc123","abc123","abc123","abc123"),2000,T),
                        VisitNum=1:2000,
                        DateDiff=sample(c(0,5,30,5),2000,T))

a <- function(dat) dat[1:(which(dat$DateDiff > 10)[1] - 1),]
b <- function(dat) dat[as.logical(cummin(dat$DateDiff < 10)), ]
microbenchmark(a(test_data), b(test_data), times = 1000)

## Unit: microseconds
##          expr     min       lq     mean  median      uq        max neval cld
##  a(test_data) 141.198 146.1895 197.6538 151.507 167.880   2326.238  1000  a 
##  b(test_data) 196.443 201.4810 496.1748 209.448 235.708 137785.448  1000   b

b ~ 38% 1000 2000 .

0

Source: https://habr.com/ru/post/1670839/


All Articles