Calculate the number of occurrences of a specific event in the past and future with groupings

Question

Calculate the number of occurrences of a specific event in the past and future with groupings

this question is a modification of the problem that I posted here , where I have occurrences of a certain type on different days, but this time they are assigned to several users, for example:

df = data.frame(user_id = c(rep(1:2, each=5)), cancelled_order = c(rep(c(0,1,1,0,0), 2)), order_date = as.Date(c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-03-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21', '2015-03-26'))) user_id cancelled_order order_date 1 0 2015-01-28 1 1 2015-01-31 1 1 2015-02-08 1 0 2015-02-23 1 0 2015-03-23 2 0 2015-01-25 2 1 2015-01-28 2 1 2015-02-06 2 0 2015-02-21 2 0 2015-03-26

I would like to calculate

1) the number of canceled orders that each client will have during the next x days (for example, 7, 14) , excluding the current one , and p>

1) the number of canceled orders that each customer had in the past x days (for example, 7, 14) , excluding the current one .

The desired result will look like this:

 solution user_id cancelled_order order_date plus14 minus14 1 0 2015-01-28 2 0 1 1 2015-01-31 1 0 1 1 2015-02-08 0 1 1 0 2015-02-23 0 0 1 0 2015-03-23 0 0 2 0 2015-01-25 2 0 2 1 2015-01-28 1 0 2 1 2015-02-06 0 1 2 0 2015-02-21 0 0 2 0 2015-03-26 0 0

A solution that is perfect for this purpose was provided by @ joel.wilson using data.table

 library(data.table) vec <- c(14, 30) # Specify desired ranges setDT(df)[, paste0("x", vec) := lapply(vec, function(i) sum(df$cancelled_order[between(df$order_date, order_date, order_date + i, # this part can be changed to reflect the past date ranges incbounds = FALSE)])), by = order_date]

However, it does not account for the grouping on user_id . When I tried to change the formula by adding this group as by = c("user_id", "order_date") or by = list(user_id, order_date) , it did not work. It seems that this is something very basic, some hints on how to get around this detail?

Also, keep in mind that I'm working on a solution, even if it is not based on the above code or data.table at all!

Thanks!

+4

r group-by data.table dplyr

Kasia Kulma Jan 12 '17 at 2:43

source share

2 answers

Maybe I made this solution a little complicated:

 library(dplyr) library(tidyr) vec <- c(7,14) reslist <- lapply(vec, function(x){ df %>% merge(df %>% rename(cancelled_order2 = cancelled_order, order_date2 = order_date)) %>% filter(abs(order_date-order_date2)<=x) %>% group_by(user_id, order_date) %>% arrange(order_date2) %>% mutate(cumcancel = cumsum(cancelled_order2)) %>% mutate(before = cumcancel - cancelled_order2, after = max(cumcancel) - cumcancel) %>% filter(order_date == order_date2) %>% select(user_id, cancelled_order, order_date, before, after) %>% mutate(within = x)}) do.call(rbind, reslist) %>% gather(key, value, -user_id, -cancelled_order, -order_date, -within) %>% mutate(col = paste0(key,"_",within)) %>% select(-within, - key) %>% spread(col, value) %>% arrange(user_id, order_date)

PS: I noticed an error in your output example (user_id 1, order_date 2015-02-23, minus14 should be 0, since there are 15 days between 02/08 and 02/23)

+1

Wietze314 Jan 12 '17 at 15:57

source share

Frank · Accepted Answer · 2017-01-12T16:24:10+0000

Here is one way:

 library(data.table) orderDT = with(df, data.table(id = user_id, completed = !cancelled_order, d = order_date)) vec = list(minus = 14L, plus = 14L) orderDT[, c("dplus", "dminus") := .( orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N , orderDT[!(completed)][orderDT[, .(id, d_minus = d - vec$minus, d_yest = d - 1L)], on=.(id, d >= d_minus, d <= d_yest), .N, by=.EACHI]$N )] id completed d dplus dminus 1: 1 TRUE 2015-01-28 2 0 2: 1 FALSE 2015-01-31 1 0 3: 1 FALSE 2015-02-08 0 1 4: 1 TRUE 2015-02-23 0 0 5: 1 TRUE 2015-03-23 0 0 6: 2 TRUE 2015-01-25 2 0 7: 2 FALSE 2015-01-28 1 0 8: 2 FALSE 2015-02-06 0 1 9: 2 TRUE 2015-02-21 0 0 10: 2 TRUE 2015-03-26 0 0

(I found the OP column names, bulky and so shortened.)

How it works

Each of the columns can start on its own, for example

 orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N

And this can be divided into several steps, simplifying:

 orderDT[!(completed)][ orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N # original version orderDT[!(completed)][ orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI] # don't extract the N column of counts orderDT[!(completed)][ orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom)] # don't create the N column of counts orderDT[!(completed)] # don't do the join orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)] # see the second table used in the join

In this case, an “unequal” compound is used, using inequalities to determine date ranges. For more information, see the Documentation Page found by typing ?data.table .

Calculate the number of occurrences of a specific event in the past and future with groupings

More articles: