Use rollsum and combine two data frames

I have two data sets, purchases and contacts. They have only a common user ID and week number.

The contact table contains the week number when the contact was made with the user. This value is 1 (contact) or 0 (without contact).

The shopping table has the week number when the purchase was made by the user.

I want to calculate, given the number of the week of purchases, if there was contact made in the previous n weeks (maybe 4, 8 or 12) starting from the current week (i.e. 4 previous weeks means the current week + 3 weeks). The number of the week is fixed, from 1 to 147.

How to do it?

The data is as follows:

purchase = data.frame(user_id = c(156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086), week_number =  c(1, 5, 9, 13, 16, 21, 30, 38, 42, 46, 50, 53, 72, 76, 83, 93, 98, 103, 110, 120, 124, 128, 133, 137, 141))

contact = data.frame(user_id = c(156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086, 156086), week_number = c(99, 120, 101, 105, 119, 117, 118, 119, 117, 118, 119, 116, 115, 118, 119, 116, 118), contacted = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))

I just put one user, but there are ~ 40 thousand users. The expected result for this user will be (I skipped user_id as it is the same as before):

output = data.frame(week_number =  c(1, 5, 9, 13, 16, 21, 30, 38, 42, 46, 50, 53, 72, 76, 83, 93, 98, 103, 110, 120, 124, 128, 133, 137, 141), contacted = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,  0,  0, 0, 0, 0))

Fist thought that I had to do a loop for users, and for each user create an array from 1 to 147 (week number), insert when the contact was made taking into account the week number, apply rollsum with a delay; then, using the week number from the purchases, see if the contact was the indicated weeks in the procurement table. But it takes some time.

Is there a way to calculate this on a single line?

Thanks!

+4
source share
1 answer

You can achieve using the rolling connection of the data.table package. FROM:

library(data.table)
setDT(purchase)
setDT(contact)
out <- contact[purchase, .(user_id = i.user_id, week_number, contacted),
               on = "week_number", roll = 4, nomatch = NA
               ][is.na(contacted), contacted := 0]

You are getting:

> out
    user_id week_number contacted
 1:  156086           1         0
 2:  156086           5         0
 3:  156086           9         0
 4:  156086          13         0
 5:  156086          16         0
 6:  156086          21         0
 7:  156086          30         0
 8:  156086          38         0
 9:  156086          42         0
10:  156086          46         0
11:  156086          50         0
12:  156086          53         0
13:  156086          72         0
14:  156086          76         0
15:  156086          83         0
16:  156086          93         0
17:  156086          98         0
18:  156086         103         1
19:  156086         110         0
20:  156086         120         1
21:  156086         124         1
22:  156086         128         0
23:  156086         133         0
24:  156086         137         0
25:  156086         141         0

Explanation:

setDT dataframes datatbales ( dataframe). purchase dataframe/datatable contact dataframe/datatable nomatch = NA, .(user_id=i.user_id, week_number, contacted) roll = 4 , , 4 .

+4

Source: https://habr.com/ru/post/1619626/


All Articles