R: calculate the number of individual categories for a specified time period

Question

R: calculate the number of individual categories for a specified time period

here are some dummy data:

user_id date category 27 2016-01-01 apple 27 2016-01-03 apple 27 2016-01-05 pear 27 2016-01-07 plum 27 2016-01-10 apple 27 2016-01-14 pear 27 2016-01-16 plum 11 2016-01-01 apple 11 2016-01-03 pear 11 2016-01-05 pear 11 2016-01-07 pear 11 2016-01-10 apple 11 2016-01-14 apple 11 2016-01-16 apple

I would like to calculate for each user_id number of different categories for a specified period of time (for example, for the last 7, 14 days), including the current order

The solution will look like this:

  user_id date category distinct_7 distinct_14 27 2016-01-01 apple 1 1 27 2016-01-03 apple 1 1 27 2016-01-05 pear 2 2 27 2016-01-07 plum 3 3 27 2016-01-10 apple 3 3 27 2016-01-14 pear 3 3 27 2016-01-16 plum 3 3 11 2016-01-01 apple 1 1 11 2016-01-03 pear 2 2 11 2016-01-05 pear 2 2 11 2016-01-07 pear 2 2 11 2016-01-10 apple 2 2 11 2016-01-14 apple 2 2 11 2016-01-16 apple 1 2

I posted similar questions here or here , however, none of them referred to the calculation of cumulative unique values for a specified period of time. Many thanks for your help!

0

r data.table dplyr distinct-values

Kasia Kulma Jan 17 '17 at 9:12

source share

2 answers

Here are two data.table solutions, one with two lapply nested, and the other with nonequilibrium connections.

The first is a rather clumsy solution to data.table , but it plays the expected answer. And this will work for an arbitrary number of time frames. (Although the @alistaire solution, the short tidyverse that he proposed in his comment, is also subject to change).

It uses two nested lapply . The first cycle goes through time frames, the second - according to dates. The result of tempory is combined with the source data and then converted from long to wide format, so we will end up with a separate column for each of the time frames.

 library(data.table) tmp <- rbindlist( lapply(c(7L, 14L), function(ldays) rbindlist( lapply(unique(dt$date), function(ldate) { dt[between(date, ldate - ldays, ldate), .(distinct = sprintf("distinct_%02i", ldays), date = ldate, N = uniqueN(category)), by = .(user_id)] }) ) ) ) dcast(tmp[dt, on=c("user_id", "date")], ... ~ distinct, value.var = "N")[order(-user_id, date, category)] # date user_id category distinct_07 distinct_14 # 1: 2016-01-01 27 apple 1 1 # 2: 2016-01-03 27 apple 1 1 # 3: 2016-01-05 27 pear 2 2 # 4: 2016-01-07 27 plum 3 3 # 5: 2016-01-10 27 apple 3 3 # 6: 2016-01-14 27 pear 3 3 # 7: 2016-01-16 27 plum 3 3 # 8: 2016-01-01 11 apple 1 1 # 9: 2016-01-03 11 pear 2 2 #10: 2016-01-05 11 pear 2 2 #11: 2016-01-07 11 pear 2 2 #12: 2016-01-10 11 apple 2 2 #13: 2016-01-14 11 apple 2 2 #14: 2016-01-16 11 apple 1 2

Here is an option suggested by @Frank that uses data.table not an equi connection instead of the second lapply :

 tmp <- rbindlist( lapply(c(7L, 14L), function(ldays) { dt[.(user_id = user_id, dago = date - ldays, d = date), on=.(user_id, date >= dago, date <= d), .(distinct = sprintf("distinct_%02i", ldays), N = uniqueN(category)), by = .EACHI] } ) )[, date := NULL] # dcast(tmp[dt, on=c("user_id", "date")], ... ~ distinct, value.var = "N")[order(-user_id, date, category)]

Data:

 dt <- fread("user_id date category 27 2016-01-01 apple 27 2016-01-03 apple 27 2016-01-05 pear 27 2016-01-07 plum 27 2016-01-10 apple 27 2016-01-14 pear 27 2016-01-16 plum 11 2016-01-01 apple 11 2016-01-03 pear 11 2016-01-05 pear 11 2016-01-07 pear 11 2016-01-10 apple 11 2016-01-14 apple 11 2016-01-16 apple") dt[, date := as.IDate(date)]

BTW: the wording for the last 7, 14 days is somewhat misleading, as the time periods consist of 8 and 15 days, respectively.

+3

Uwe Jan 18 '17 at 9:17

source share

alistaire · Accepted Answer · 2017-01-20T00:53:14+0000

In tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply . Count the various occurrences using an n_distinct (e.g. length(unique(...)) ) subset of objects by comparison or between helper, with a minimum value corresponding to the amount subtracted from that day, and you are set up.

 library(tidyverse) df %>% group_by(user_id) %>% mutate(distinct_7 = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)]))) ## Source: local data frame [14 x 5] ## Groups: user_id [2] ## ## user_id date category distinct_7 distinct_14 ## <int> <date> <fctr> <int> <int> ## 1 27 2016-01-01 apple 1 1 ## 2 27 2016-01-03 apple 1 1 ## 3 27 2016-01-05 pear 2 2 ## 4 27 2016-01-07 plum 3 3 ## 5 27 2016-01-10 apple 3 3 ## 6 27 2016-01-14 pear 3 3 ## 7 27 2016-01-16 plum 3 3 ## 8 11 2016-01-01 apple 1 1 ## 9 11 2016-01-03 pear 2 2 ## 10 11 2016-01-05 pear 2 2 ## 11 11 2016-01-07 pear 2 2 ## 12 11 2016-01-10 apple 2 2 ## 13 11 2016-01-14 apple 2 2 ## 14 11 2016-01-16 apple 1 2

R: calculate the number of individual categories for a specified time period

More articles: