Calculate the number of occurrences of a specific event in the past and future with groupings

this question is a modification of the problem that I posted here , where I have occurrences of a certain type on different days, but this time they are assigned to several users, for example:

df = data.frame(user_id = c(rep(1:2, each=5)), cancelled_order = c(rep(c(0,1,1,0,0), 2)), order_date = as.Date(c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-03-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21', '2015-03-26'))) user_id cancelled_order order_date 1 0 2015-01-28 1 1 2015-01-31 1 1 2015-02-08 1 0 2015-02-23 1 0 2015-03-23 2 0 2015-01-25 2 1 2015-01-28 2 1 2015-02-06 2 0 2015-02-21 2 0 2015-03-26 

I would like to calculate

1) the number of canceled orders that each client will have during the next x days (for example, 7, 14) , excluding the current one , and p>

1) the number of canceled orders that each customer had in the past x days (for example, 7, 14) , excluding the current one .

The desired result will look like this:

 solution user_id cancelled_order order_date plus14 minus14 1 0 2015-01-28 2 0 1 1 2015-01-31 1 0 1 1 2015-02-08 0 1 1 0 2015-02-23 0 0 1 0 2015-03-23 0 0 2 0 2015-01-25 2 0 2 1 2015-01-28 1 0 2 1 2015-02-06 0 1 2 0 2015-02-21 0 0 2 0 2015-03-26 0 0 

A solution that is perfect for this purpose was provided by @ joel.wilson using data.table

 library(data.table) vec <- c(14, 30) # Specify desired ranges setDT(df)[, paste0("x", vec) := lapply(vec, function(i) sum(df$cancelled_order[between(df$order_date, order_date, order_date + i, # this part can be changed to reflect the past date ranges incbounds = FALSE)])), by = order_date] 

However, it does not account for the grouping on user_id . When I tried to change the formula by adding this group as by = c("user_id", "order_date") or by = list(user_id, order_date) , it did not work. It seems that this is something very basic, some hints on how to get around this detail?

Also, keep in mind that I'm working on a solution, even if it is not based on the above code or data.table at all!

Thanks!

+4
source share
2 answers

Here is one way:

 library(data.table) orderDT = with(df, data.table(id = user_id, completed = !cancelled_order, d = order_date)) vec = list(minus = 14L, plus = 14L) orderDT[, c("dplus", "dminus") := .( orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N , orderDT[!(completed)][orderDT[, .(id, d_minus = d - vec$minus, d_yest = d - 1L)], on=.(id, d >= d_minus, d <= d_yest), .N, by=.EACHI]$N )] id completed d dplus dminus 1: 1 TRUE 2015-01-28 2 0 2: 1 FALSE 2015-01-31 1 0 3: 1 FALSE 2015-02-08 0 1 4: 1 TRUE 2015-02-23 0 0 5: 1 TRUE 2015-03-23 0 0 6: 2 TRUE 2015-01-25 2 0 7: 2 FALSE 2015-01-28 1 0 8: 2 FALSE 2015-02-06 0 1 9: 2 TRUE 2015-02-21 0 0 10: 2 TRUE 2015-03-26 0 0 

(I found the OP column names, bulky and so shortened.)


How it works

Each of the columns can start on its own, for example

 orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N 

And this can be divided into several steps, simplifying:

 orderDT[!(completed)][ orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N # original version orderDT[!(completed)][ orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI] # don't extract the N column of counts orderDT[!(completed)][ orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom)] # don't create the N column of counts orderDT[!(completed)] # don't do the join orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)] # see the second table used in the join 

In this case, an β€œunequal” compound is used, using inequalities to determine date ranges. For more information, see the Documentation Page found by typing ?data.table .

+3
source

Maybe I made this solution a little complicated:

 library(dplyr) library(tidyr) vec <- c(7,14) reslist <- lapply(vec, function(x){ df %>% merge(df %>% rename(cancelled_order2 = cancelled_order, order_date2 = order_date)) %>% filter(abs(order_date-order_date2)<=x) %>% group_by(user_id, order_date) %>% arrange(order_date2) %>% mutate(cumcancel = cumsum(cancelled_order2)) %>% mutate(before = cumcancel - cancelled_order2, after = max(cumcancel) - cumcancel) %>% filter(order_date == order_date2) %>% select(user_id, cancelled_order, order_date, before, after) %>% mutate(within = x)}) do.call(rbind, reslist) %>% gather(key, value, -user_id, -cancelled_order, -order_date, -within) %>% mutate(col = paste0(key,"_",within)) %>% select(-within, - key) %>% spread(col, value) %>% arrange(user_id, order_date) 

PS: I noticed an error in your output example (user_id 1, order_date 2015-02-23, minus14 should be 0, since there are 15 days between 02/08 and 02/23)

+1
source

Source: https://habr.com/ru/post/1263001/


All Articles