R, dplyr: cumulative version of n_distinct

Question

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by the time column.

Entrance -

 df = data.frame(time = 1:20, grp = sort(rep(1:5,4)), var1 = rep(c('A','B'),10) ) head(df,10) time grp var1 1 1 1 A 2 2 1 B 3 3 1 A 4 4 1 B 5 5 2 A 6 6 2 B 7 7 2 A 8 8 2 B 9 9 3 A 10 10 3 B

I want to create another var2 variable that does not calculate the values of different var1 values until this point is in time for each grp group. This is slightly different from what I would get if I used n_distinct .

Expected Result -

  time grp var1 var2 1 1 1 A 1 2 2 1 B 2 3 3 1 A 2 4 4 1 B 2 5 5 2 A 1 6 6 2 B 2 7 7 2 A 2 8 8 2 B 2 9 9 3 A 1 10 10 3 B 2

I want to create a say cum_n_distinct function for this and use it like -

 d_out = df %>% arrange(time) %>% group_by(grp) %>% mutate(var2 = cum_n_distinct(var1))

+6

r dplyr cumsum

steadyfish Aug 28 '14 at 15:55

source share

3 answers

A `dplyr` solution based on @akrun's answer -

This logic basically sets the 1st occurrence of each unique value from var1 to 1 and remains 0 for each grp group, and then applies cumsum to it -

 df = df %>% arrange(time) %>% group_by(grp,var1) %>% mutate(var_temp = ifelse(row_number()==1,1,0)) %>% group_by(grp) %>% mutate(var2 = cumsum(var_temp)) %>% select(-var_temp) head(df,10) Source: local data frame [10 x 4] Groups: grp time grp var1 var2 1 1 1 A 1 2 2 1 B 2 3 3 1 A 2 4 4 1 B 2 5 5 2 A 1 6 6 2 B 2 7 7 2 A 2 8 8 2 B 2 9 9 3 A 1 10 10 3 B 2

+4

steadyfish Aug 28 '14 at 18:33

source share

Try:

Update

With your new dataset, the approach in the R database

  df$var2 <- unlist(lapply(split(df, df$grp), function(x) {x$var2 <-0 indx <- match(unique(x$var1), x$var1) x$var2[indx] <- 1 cumsum(x$var2) })) head(df,7) # time grp var1 var2 # 1 1 1 A 1 # 2 2 1 B 2 # 3 3 1 A 2 # 4 4 1 B 2 # 5 5 2 A 1 # 6 6 2 B 2 # 7 7 2 A 2

+2

akrun Aug 28 '14 at 16:30

source share

Brodieg · Accepted Answer · 2014-08-28T18:20:39+0000

Assuming the material is ordered time already, first define a cumulative excellent function:

 dist_cum <- function(var) sapply(seq_along(var), function(x) length(unique(head(var, x))))

Then the basic solution that ave uses to create groups (note that accepts var1 is a factor), and then applies our function to each group:

 transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))

A data.table , basically doing the same thing:

 library(data.table) (data.table(df)[, var2:=dist_cum(var1), by=grp])

And dplyr , again:

 library(dplyr) df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))

R, dplyr: cumulative version of n_distinct

A dplyr solution based on @akrun's answer -

Update

More articles:

A `dplyr` solution based on @akrun's answer -