I have irregular time data representing a certain type of transaction for users. Each row of data has a time stamp and represents a transaction at that time. Due to data irregularities, some users may have 100 rows per day, while other users may have 0 or 1 transaction per day.
The data may look something like this:
data.frame( id = c(1, 1, 1, 1, 1, 2, 2, 3, 4), date = c("2015-01-01", "2015-01-01", "2015-01-05", "2015-01-25", "2015-02-15", "2015-05-05", "2015-01-01", "2015-08-01", "2015-01-01"), n_widgets = c(1,2,3,4,4,5,2,4,5) ) id date n_widgets 1 1 2015-01-01 1 2 1 2015-01-01 2 3 1 2015-01-05 3 4 1 2015-01-25 4 5 1 2015-02-15 4 6 2 2015-05-05 5 7 2 2015-01-01 2 8 3 2015-08-01 4 9 4 2015-01-01 5
Often I would like to know the quick statistics of users. For example: for this user on a certain day, how many transactions occurred in the previous 30 days, how many widgets were sold in the previous 30 days, etc.
According to the above example, the data should look like this:
id date n_widgets n_trans_30 total_widgets_30 1 1 2015-01-01 1 1 1 2 1 2015-01-01 2 2 3 3 1 2015-01-05 3 3 6 4 1 2015-01-25 4 4 10 5 1 2015-02-15 4 2 8 6 2 2015-05-05 5 1 5 7 2 2015-01-01 2 1 2 8 3 2015-08-01 4 1 4 9 4 2015-01-01 5 1 5
If the time window is daily, then the solution is simple: data %>% group_by(id, date) %>% summarize(...)
Similarly, if the time window is monthly, it is also relatively simple with lubridate: data %>% group_by(id, year(date), month(date)) %>% summarize(...)
However, the problem I am facing is how to set the time window to an arbitrary period: 5 days, 10 days, etc.
There is also an RcppRoll
library, but both RcppRoll
and the rolling functions in zoo
seem more customizable for regular time series. As far as I can tell, these window functions work based on the number of lines instead of a given time period - the key difference is that a certain period of time can have a different number of lines depending on the date and user.
For example, for user 1, it is possible that the number of transactions for 5 days of the previous 2015-01-01
is 100 transactions, and for the same user, the number of transactions for 5 days of the previous 2015-02-01
is equal to 5 transactions. So, looking back, the number of rows just won't work.
In addition, there is another SO thread discussing logging dates for data such as irregular time series ( Create a new column based on a condition that exists during the moving date ), however, the decision I made made use of data.table
, and I am specifically looking for a way dplyr
to achieve this .
I believe that at the heart of this problem you can solve this problem by answering this question: how do I group_by
arbitrary time periods in dplyr
. Alternatively, if there is another dplyr
way to achieve the above without complex group_by
, how can I do this?
EDIT: Updated example to make the nature of the scanning window more clear.