Dplyr: group and summarize / modify data using time windows

Question

Dplyr: group and summarize / modify data using time windows

I have irregular time data representing a certain type of transaction for users. Each row of data has a time stamp and represents a transaction at that time. Due to data irregularities, some users may have 100 rows per day, while other users may have 0 or 1 transaction per day.

The data may look something like this:

data.frame( id = c(1, 1, 1, 1, 1, 2, 2, 3, 4), date = c("2015-01-01", "2015-01-01", "2015-01-05", "2015-01-25", "2015-02-15", "2015-05-05", "2015-01-01", "2015-08-01", "2015-01-01"), n_widgets = c(1,2,3,4,4,5,2,4,5) ) id date n_widgets 1 1 2015-01-01 1 2 1 2015-01-01 2 3 1 2015-01-05 3 4 1 2015-01-25 4 5 1 2015-02-15 4 6 2 2015-05-05 5 7 2 2015-01-01 2 8 3 2015-08-01 4 9 4 2015-01-01 5

Often I would like to know the quick statistics of users. For example: for this user on a certain day, how many transactions occurred in the previous 30 days, how many widgets were sold in the previous 30 days, etc.

According to the above example, the data should look like this:

  id date n_widgets n_trans_30 total_widgets_30 1 1 2015-01-01 1 1 1 2 1 2015-01-01 2 2 3 3 1 2015-01-05 3 3 6 4 1 2015-01-25 4 4 10 5 1 2015-02-15 4 2 8 6 2 2015-05-05 5 1 5 7 2 2015-01-01 2 1 2 8 3 2015-08-01 4 1 4 9 4 2015-01-01 5 1 5

If the time window is daily, then the solution is simple: data %>% group_by(id, date) %>% summarize(...)

Similarly, if the time window is monthly, it is also relatively simple with lubridate: data %>% group_by(id, year(date), month(date)) %>% summarize(...)

However, the problem I am facing is how to set the time window to an arbitrary period: 5 days, 10 days, etc.

There is also an RcppRoll library, but both RcppRoll and the rolling functions in zoo seem more customizable for regular time series. As far as I can tell, these window functions work based on the number of lines instead of a given time period - the key difference is that a certain period of time can have a different number of lines depending on the date and user.

For example, for user 1, it is possible that the number of transactions for 5 days of the previous 2015-01-01 is 100 transactions, and for the same user, the number of transactions for 5 days of the previous 2015-02-01 is equal to 5 transactions. So, looking back, the number of rows just won't work.

In addition, there is another SO thread discussing logging dates for data such as irregular time series ( Create a new column based on a condition that exists during the moving date ), however, the decision I made made use of data.table , and I am specifically looking for a way dplyr to achieve this .

I believe that at the heart of this problem you can solve this problem by answering this question: how do I group_by arbitrary time periods in dplyr . Alternatively, if there is another dplyr way to achieve the above without complex group_by , how can I do this?

EDIT: Updated example to make the nature of the scanning window more clear.

+5

r time-series lubridate dplyr

divide_by_zero Mar 23 '16 at 20:18

source share

3 answers

G. grothendieck · Answer 1 · 2016-03-23T21:49:46+0000

This can be done using SQL:

 library(sqldf) dd <- transform(data, date = as.Date(date)) sqldf("select a.*, count(*) n_trans30, sum(b.n_widgets) 'total_widgets30' from dd a left join dd b on b.date between a.date - 30 and a.date and b.id = a.id and b.rowid <= a.rowid group by a.rowid")

giving:

  id date n_widgets n_trans30 total_widgets30 1 1 2015-01-01 1 1 1 2 1 2015-01-01 2 2 3 3 1 2015-01-05 3 3 6 4 1 2015-01-25 4 4 10 5 2 2015-05-05 5 1 5 6 2 2015-01-01 2 1 2 7 3 2015-08-01 4 1 4 8 4 2015-01-01 5 1 5

Matifou · Answer 2 · 2016-09-07T15:06:19+0000

Another approach is to expand your dataset so that it contains all possible days (using tidyr::complete ), then use the rolling function ( RcppRoll::roll_sum )

The fact that you have a few observations a day probably creates a problem, though ...

 library(tidyr) library(RcppRoll) df2 <- df %>% mutate(date=as.Date(date)) ## create full dataset with all possible dates (go even 30 days back for first observation) df_full<- df2 %>% mutate(date=as.Date(date)) %>% complete(id, date=seq(from=min(.$date)-30,to=max(.$date), by=1), fill=list(n_widgets=0)) ## now use rolling function, and keep only original rows (left join) df_roll <- df_full %>% group_by(id) %>% mutate(n_trans_30=roll_sum(x=n_widgets!=0, n=30, fill=0, align="right"), total_widgets_30=roll_sum(x=n_widgets, n=30, fill=0, align="right")) %>% ungroup() %>% right_join(df2, by = c("date", "id", "n_widgets"))

The result matches yours (randomly)

  id date n_widgets n_trans_30 total_widgets_30 <dbl> <date> <dbl> <dbl> <dbl> 1 1 2015-01-01 1 1 1 2 1 2015-01-01 2 2 3 3 1 2015-01-05 3 3 6 4 1 2015-01-25 4 4 10 5 1 2015-02-15 4 2 8 6 2 2015-05-05 5 1 5 7 2 2015-01-01 2 1 2 8 3 2015-08-01 4 1 4 9 4 2015-01-01 5 1 5

But as said, it will fail in a few days, since it counts the last 30 obs, and not the last 30 days. So first you can summarise get the information by day, and then apply it.

Gopala · Answer 3 · 2016-03-23T20:31:45+0000

EDITED based on the comment below.

You can try something like this for 5 days:

 df %>% arrange(id, date) %>% group_by(id) %>% filter(as.numeric(difftime(Sys.Date(), date, unit = 'days')) <= 5) %>% summarise(n_total_widgets = sum(n_widgets))

In this case, there are no days for the current five. Thus, it will not output any results.

To get the last five days for each ID, you can do something like this:

 df %>% arrange(id, date) %>% group_by(id) %>% filter(as.numeric(difftime(max(date), date, unit = 'days')) <= 5) %>% summarise(n_total_widgets = sum(n_widgets))

The output will be:

 Source: local data frame [4 x 2] id n_total_widgets (dbl) (dbl) 1 1 4 2 2 5 3 3 4 4 4 5

Dplyr: group and summarize / modify data using time windows

More articles: