R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by the time column.

Entrance -

 df = data.frame(time = 1:20, grp = sort(rep(1:5,4)), var1 = rep(c('A','B'),10) ) head(df,10) time grp var1 1 1 1 A 2 2 1 B 3 3 1 A 4 4 1 B 5 5 2 A 6 6 2 B 7 7 2 A 8 8 2 B 9 9 3 A 10 10 3 B 

I want to create another var2 variable that does not calculate the values ​​of different var1 values ​​until this point is in time for each grp group. This is slightly different from what I would get if I used n_distinct .

Expected Result -

  time grp var1 var2 1 1 1 A 1 2 2 1 B 2 3 3 1 A 2 4 4 1 B 2 5 5 2 A 1 6 6 2 B 2 7 7 2 A 2 8 8 2 B 2 9 9 3 A 1 10 10 3 B 2 

I want to create a say cum_n_distinct function for this and use it like -

 d_out = df %>% arrange(time) %>% group_by(grp) %>% mutate(var2 = cum_n_distinct(var1)) 
+6
source share
3 answers

Assuming the material is ordered time already, first define a cumulative excellent function:

 dist_cum <- function(var) sapply(seq_along(var), function(x) length(unique(head(var, x)))) 

Then the basic solution that ave uses to create groups (note that accepts var1 is a factor), and then applies our function to each group:

 transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum)) 

A data.table , basically doing the same thing:

 library(data.table) (data.table(df)[, var2:=dist_cum(var1), by=grp]) 

And dplyr , again:

 library(dplyr) df %>% group_by(grp) %>% mutate(var2=dist_cum(var1)) 
+5
source

A dplyr solution based on @akrun's answer -

This logic basically sets the 1st occurrence of each unique value from var1 to 1 and remains 0 for each grp group, and then applies cumsum to it -

 df = df %>% arrange(time) %>% group_by(grp,var1) %>% mutate(var_temp = ifelse(row_number()==1,1,0)) %>% group_by(grp) %>% mutate(var2 = cumsum(var_temp)) %>% select(-var_temp) head(df,10) Source: local data frame [10 x 4] Groups: grp time grp var1 var2 1 1 1 A 1 2 2 1 B 2 3 3 1 A 2 4 4 1 B 2 5 5 2 A 1 6 6 2 B 2 7 7 2 A 2 8 8 2 B 2 9 9 3 A 1 10 10 3 B 2 
+4
source

Try:

Update

With your new dataset, the approach in the R database

  df$var2 <- unlist(lapply(split(df, df$grp), function(x) {x$var2 <-0 indx <- match(unique(x$var1), x$var1) x$var2[indx] <- 1 cumsum(x$var2) })) head(df,7) # time grp var1 var2 # 1 1 1 A 1 # 2 2 1 B 2 # 3 3 1 A 2 # 4 4 1 B 2 # 5 5 2 A 1 # 6 6 2 B 2 # 7 7 2 A 2 
+2
source

Source: https://habr.com/ru/post/974501/


All Articles