Vectorize cumsum to R

I am trying to create a column in a very large data frame (~ 2.2 million rows) that calculates a total of 1 for each factor level and is reset when a new factor level is reached. Below are some basic data that resemble my own.

itemcode <- c('a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a4', 'a4', 'a5', 'a6', 'a6', 'a6', 'a6') goodp <- c(0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1) df <- data.frame(itemcode, goodp) 

I would like the cum.goodp output variable to look like this:

 cum.goodp <- c(0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 0, 1) 

I understand that there are many possibilities for using the canonical split-apply-comb approach, which is conceptually intuitive, but I tried to use the following:

 k <- transform(df, cum.goodp = goodp*ave(goodp, c(0L, cumsum(diff(goodp != 0)), FUN = seq_along, by = itemcode))) 

When I try to run this code, it is very slow. I get the conversion to be part of the reason why ("by" doesn't help). There are more than 70K different values โ€‹โ€‹for the itemcode variable, so it should probably be vectorized. Is there a way to vectorize this using cumsum? If not, any help would be truly appreciated. Thank you very much.

+5
source share
2 answers

With the modified I / O example, you can use the following basic R approach (among others):

 transform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum)) # itemcode goodp cum.goodp cum.goodpX #1 a1 0 0 0 #2 a1 1 1 1 #3 a1 1 2 2 #4 a1 0 0 0 #5 a1 1 1 1 #6 a2 1 1 1 #7 a2 1 2 2 #8 a3 0 0 0 #9 a4 0 0 0 #10 a4 1 1 1 #11 a5 1 1 1 #12 a6 1 1 1 #13 a6 1 2 2 #14 a6 0 0 0 #15 a6 1 1 1 

Note. I added a cum.goodp column to the df input and created a new cum.goodpX column so you can easily compare the two.

But, of course, you can use many other approaches with packages, either suggest @MartinMorgan, or, for example, use dplyr or data.table to name only two options. This can be much faster than the basic R approaches for large datasets.

Here's how to do it in dplyr:

 library(dplyr) df %>% group_by(itemcode, grp = cumsum(goodp == 0)) %>% mutate(cum.goodpX = cumsum(goodp)) 

In the comments to your question, the data.table parameter has already been specified.

+3
source

The basic R approach is to compute cumsum across the entire vector and capture subscription geometry using runtime coding. Find out the beginning of each group and create new groups.

 start <- c(TRUE, itemcode[-1] != itemcode[-length(itemcode)]) | !goodp f <- cumsum(start) 

Summarize them as run length encoding and calculate the total

 r <- rle(f) x <- cumsum(x) 

Then use the geometry to get the offset you need to correct with each built-in sum,

 offset <- c(0, x[cumsum(r$lengths)]) 

and calculate the updated value

 x - rep(offset[-length(offset)], r$lengths) 

Here is the function

 cumsumByGroup <- function(x, f) { start <- c(TRUE, f[-1] != f[-length(f)]) | !x r <- rle(cumsum(start)) x <- cumsum(x) offset <- c(0, x[cumsum(r$lengths)]) x - rep(offset[-length(offset)], r$lengths) } 

Here is the result applied to the sample data.

 > cumsumByGroup(goodp, itemcode) [1] 0 1 2 0 1 1 2 0 0 1 1 1 2 0 1 

and performance

 > n <- 1 + rpois(1000000, 1) > goodp <- sample(c(0, 1), sum(n), TRUE) > itemcode <- rep(seq_along(n), n) > system.time(cumsumByGroup(goodp, itemcode)) user system elapsed 0.55 0.00 0.55 

The dplyr solution takes about 70 seconds.

@Alexis_laz's solution is elegant and 2 times faster than mine

 cumsumByGroup1 <- function(x, f) { start <- c(TRUE, f[-1] != f[-length(f)]) | !x cs = cumsum(x) cs - cummax((cs - x) * start) } 
+11
source

Source: https://habr.com/ru/post/1244754/


All Articles