I am trying to create a column in a very large data frame (~ 2.2 million rows) that calculates a total of 1 for each factor level and is reset when a new factor level is reached. Below are some basic data that resemble my own.
itemcode <- c('a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a4', 'a4', 'a5', 'a6', 'a6', 'a6', 'a6') goodp <- c(0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1) df <- data.frame(itemcode, goodp)
I would like the cum.goodp output variable to look like this:
cum.goodp <- c(0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 0, 1)
I understand that there are many possibilities for using the canonical split-apply-comb approach, which is conceptually intuitive, but I tried to use the following:
k <- transform(df, cum.goodp = goodp*ave(goodp, c(0L, cumsum(diff(goodp != 0)), FUN = seq_along, by = itemcode)))
When I try to run this code, it is very slow. I get the conversion to be part of the reason why ("by" doesn't help). There are more than 70K different values โโfor the itemcode variable, so it should probably be vectorized. Is there a way to vectorize this using cumsum? If not, any help would be truly appreciated. Thank you very much.