How to repeat a sequence when a condition is met

I found options for this question, and I know that modulos can be used, but it's hard for me to do all this together.

I have a sequence of observations by ID and seconds. When the cumulative number of seconds with an identifier increases for more than 5 seconds, I would like to restart the account. Can someone help me answer this question in dplyr?

Original df

df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3), val = c(2,10,12,15,17,2,4,7,8,12,15,20,25)) df id val 1 1 2 2 1 10 3 1 12 4 1 15 5 1 17 6 2 2 7 2 4 8 2 7 9 2 8 10 3 12 11 3 15 12 3 20 13 3 25 

Desired Result

 finalResult id val reset 1 1 2 1 2 1 10 2 3 1 12 2 4 1 15 3 5 1 17 3 6 2 2 1 7 2 4 1 8 2 7 2 9 2 8 2 10 3 12 1 11 3 15 1 12 3 20 2 13 3 25 3 

Edit

Thanks for the answers yesterday, but I ran into some problems with these solutions.

In this data set, the code works in some cases.

 sub.df <- structure(list(`ID` = c("1", "1", "1", "1", "1", "1", "1", "1", "1" ), dateFormat = structure(c(1479955726, 1479955726, 1483703713, 1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061 ), class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")), .Names = c("ID", "dateFormat"), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame")) 

Solution Used:

 jj <- sub.df %>% group_by(`ID`) %>% arrange(`ID`,`dateFormat`)%>% mutate(totalTimeInt = difftime(dateFormat,first(dateFormat),units = 'secs'))%>% mutate(totalTimeFormat = as.numeric(totalTimeInt))%>% mutate(reset = cumsum( Reduce( function(x, y) if (x + y >= 5) 0 else x + y, diff(totalTimeFormat), init = 0, accumulate = TRUE ) == 0 ))%>% mutate(reset_2 = cumsum( accumulate( diff(totalTimeFormat), ~if (.x + .y >= 5) 0 else .x + .y, .init = 0 ) == 0 )) 

Result

 # A tibble: 9 x 6 # Groups: ID [1] ID dateFormat totalTimeInt totalTimeFormat reset reset_2 <chr> <dttm> <time> <dbl> <int> <int> 1 1 2016-09-16 05:50:59 0 secs 0 1 1 2 1 2016-09-16 05:51:01 2 secs 2 1 1 3 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2 4 1 2016-11-23 20:48:46 5932667 secs 5932667 3 3 5 1 2017-01-06 05:55:13 9680654 secs 9680654 4 4 6 1 2017-05-19 05:46:49 21167750 secs 21167750 5 5 7 1 2017-05-19 05:46:49 21167750 secs 21167750 6 6 8 1 2017-06-12 05:57:59 23242020 secs 23242020 7 7 9 1 2017-06-12 05:57:59 23242020 secs 23242020 8 8 

What happens is that for the first two observations he correctly takes this into account as 1 instance. When he reaches the third and fourth observations, this should be considered only as two observations, since essentially there was no time that elapsed between the two instances.

Correct output:

 # A tibble: 9 x 6 # Groups: ID [1] ID dateFormat totalTimeInt totalTimeFormat reset reset_2 <chr> <dttm> <time> <dbl> <int> <int> 1 1 2016-09-16 05:50:59 0 secs 0 1 1 2 1 2016-09-16 05:51:01 2 secs 2 1 1 3 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2 4 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2 5 1 2017-01-06 05:55:13 9680654 secs 9680654 3 3 6 1 2017-05-19 05:46:49 21167750 secs 21167750 4 4 7 1 2017-05-19 05:46:49 21167750 secs 21167750 4 4 8 1 2017-06-12 05:57:59 23242020 secs 23242020 5 5 9 1 2017-06-12 05:57:59 23242020 secs 23242020 5 5 
+5
source share
2 answers

If you use Reduce with accumulate = TRUE (or purrr::accumulate if you want), you can reset the difference in the move if it is greater than or equal to 5. Calling cumsum on whether 0 returns the number of discards.

 library(tidyverse) df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3), val = c(2,10,12,15,17,2,4,7,8,12,15,20,25)) df %>% group_by(id) %>% mutate(reset = cumsum( Reduce( function(x, y) if (x + y >= 5) 0 else x + y, diff(val), init = 0, accumulate = TRUE ) == 0 )) #> # A tibble: 13 x 3 #> # Groups: id [3] #> id val reset #> <dbl> <dbl> <int> #> 1 1 2 1 #> 2 1 10 2 #> 3 1 12 2 #> 4 1 15 3 #> 5 1 17 3 #> 6 2 2 1 #> 7 2 4 1 #> 8 2 7 2 #> 9 2 8 2 #> 10 3 12 1 #> 11 3 15 1 #> 12 3 20 2 #> 13 3 25 3 

or using purrr::accumulate ,

 df %>% group_by(id) %>% mutate(reset = cumsum( accumulate( diff(val), ~if (.x + .y >= 5) 0 else .x + .y, .init = 0 ) == 0 )) #> # A tibble: 13 x 3 #> # Groups: id [3] #> id val reset #> <dbl> <dbl> <int> #> 1 1 2 1 #> 2 1 10 2 #> 3 1 12 2 #> 4 1 15 3 #> 5 1 17 3 #> 6 2 2 1 #> 7 2 4 1 #> 8 2 7 2 #> 9 2 8 2 #> 10 3 12 1 #> 11 3 15 1 #> 12 3 20 2 #> 13 3 25 3 

Regarding editing, the problem is that some of the differences are 0, which is the same as what it considers to be discarded. The simplest solution is to use NA instead of zero as the reset value:

 library(tidyverse) sub.df <- structure(list(`ID` = c("1", "1", "1", "1", "1", "1", "1", "1", "1"), dateFormat = structure(c(1479955726, 1479955726, 1483703713, 1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061), class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")), .Names = c("ID", "dateFormat"), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame")) sub.df %>% group_by(ID) %>% arrange(ID, dateFormat) %>% mutate(reset = cumsum(is.na( accumulate(diff(dateFormat), ~{ s <- sum(.x, .y, na.rm = TRUE); if (s >= 5) NA else s }, .init = NA) ))) #> # A tibble: 9 x 3 #> # Groups: ID [1] #> ID dateFormat reset #> <chr> <dttm> <int> #> 1 1 2016-09-16 05:50:59 1 #> 2 1 2016-09-16 05:51:01 1 #> 3 1 2016-11-23 20:48:46 2 #> 4 1 2016-11-23 20:48:46 2 #> 5 1 2017-01-06 05:55:13 3 #> 6 1 2017-05-19 05:46:49 4 #> 7 1 2017-05-19 05:46:49 4 #> 8 1 2017-06-12 05:57:59 5 #> 9 1 2017-06-12 05:57:59 5 

Ultimately, this approach also faces limitations, although, as if any values ​​were indeed NA , it would increase in a similar way. A more robust solution would be to return a list of two elements from each iteration, one for the total number with resets, and one for counting reset. However, this is a lot of work to implement:

 sub.df %>% group_by(ID) %>% arrange(ID, dateFormat) %>% mutate(total_reset = accumulate( transpose(list(total = diff(dateFormat), reset = rep(0, n() - 1))), ~{ s <- .x$total + .y$total; if (s >= 5) { data_frame(total = 0, reset = .x$reset + 1) } else { data_frame(total = s, reset = .x$reset) } }, .init = data_frame(total = 0, reset = 1) )) %>% unnest() #> # A tibble: 9 x 4 #> # Groups: ID [1] #> ID dateFormat total reset #> <chr> <dttm> <dbl> <dbl> #> 1 1 2016-09-16 05:50:59 0 1 #> 2 1 2016-09-16 05:51:01 2 1 #> 3 1 2016-11-23 20:48:46 0 2 #> 4 1 2016-11-23 20:48:46 0 2 #> 5 1 2017-01-06 05:55:13 0 3 #> 6 1 2017-05-19 05:46:49 0 4 #> 7 1 2017-05-19 05:46:49 0 4 #> 8 1 2017-06-12 05:57:59 0 5 #> 9 1 2017-06-12 05:57:59 0 5 

The general one looks a little silly, but if you look at diff, it is really correct.

+4
source

Maybe I'm wrong (EDIT: I was wrong, alistaire is a brilliant answer , although I leave this approach now), but I think this is one example when you really need a loop, because the reset value in each line will depend on what happened for the previous lines. I hope that Joseph Wood comes up with something smarter than this, but at the same time it is a naive approach that uses dplyr on demand. We can do the following function

 count_resets <- function(x) { N <- length(x) value <- 1 result <- rep(1, N) threshold <- x[1] for ( i in 2:N ) { if ( abs(x[i] - threshold) >= 5) { value <- value + 1 threshold <- x[i] } result[i] <- value } return(result) } 

And apply its id with dplyr group_by() :

 library(dplyr) df %>% group_by(id) %>% mutate(reset = count_resets(val)) # A tibble: 13 x 3 # Groups: id [3] id val reset <dbl> <dbl> <dbl> 1 1 2 1 2 1 10 2 3 1 12 2 4 1 15 3 5 1 17 3 6 2 2 1 7 2 4 1 8 2 7 2 9 2 8 2 10 3 12 1 11 3 15 1 12 3 20 2 13 3 25 3 
+2
source

Source: https://habr.com/ru/post/1273898/


All Articles