Calculation of simple retention in R

For a data set, testmy goal is to find out how many unique users were transferred from one period to another depending on the period.

> test
   user_id period
1        1      1
2        5      1
3        1      1
4        3      1
5        4      1
6        2      2
7        3      2
8        2      2
9        3      2
10       1      2
11       5      3
12       5      3
13       2      3
14       1      3
15       4      3
16       5      4
17       5      4
18       5      4
19       4      4
20       3      4

For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore, the retention rate will be 0.5. In the second period there were three unique users, two of which were active in the third period, so the retention rate will be 0.666, etc. How can I find the percentage of unique users who are active in the next period? Any suggestions would be appreciated.

The output will be as follows:

> output
  period retention
1      1        NA
2      2     0.500
3      3     0.666
4      4     0.500

Data test:

> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5, 
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")
+6
source share
3

, , , . , df - :

# make a list to hold unique IDS by 
uniques = list()
for(i in 1:max(df$period)){
  uniques[[i]] = unique(df$user_id[df$period == i])
}

# hold the retention rates
retentions = rep(NA, times = max(df$period))

for(j in 2:max(df$period)){
  retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
}

% in% , . , .

0

? , , , mapply.

splt <- split(test$user_id, test$period)

carryover <- function(x, y) {
    length(unique(intersect(x, y))) / length(unique(x))
}
mapply(carryover, splt[1:(length(splt) - 1)], splt[2:length(splt)])

        1         2         3 
0.5000000 0.6666667 0.5000000 
+4

dplyr, summarise:

test %>% 
group_by(period) %>% 
summarise(retention=length(intersect(user_id,test$user_id[test$period==(period+1)]))/n_distinct(user_id)) %>% 
mutate(retention=lag(retention))

:

period retention
   <dbl>     <dbl>
1      1        NA
2      2 0.5000000
3      3 0.6666667
4      4 0.5000000
+3

Source: https://habr.com/ru/post/1017398/


All Articles