Vectorize cumsum to R

Question

Vectorize cumsum to R

I am trying to create a column in a very large data frame (~ 2.2 million rows) that calculates a total of 1 for each factor level and is reset when a new factor level is reached. Below are some basic data that resemble my own.

itemcode <- c('a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a4', 'a4', 'a5', 'a6', 'a6', 'a6', 'a6') goodp <- c(0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1) df <- data.frame(itemcode, goodp)

I would like the cum.goodp output variable to look like this:

 cum.goodp <- c(0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 0, 1)

I understand that there are many possibilities for using the canonical split-apply-comb approach, which is conceptually intuitive, but I tried to use the following:

 k <- transform(df, cum.goodp = goodp*ave(goodp, c(0L, cumsum(diff(goodp != 0)), FUN = seq_along, by = itemcode)))

When I try to run this code, it is very slow. I get the conversion to be part of the reason why ("by" doesn't help). There are more than 70K different values for the itemcode variable, so it should probably be vectorized. Is there a way to vectorize this using cumsum? If not, any help would be truly appreciated. Thank you very much.

+5

vectorization r cumsum

jvalenti Mar 09 '16 at 17:27

source share

2 answers

The basic R approach is to compute cumsum across the entire vector and capture subscription geometry using runtime coding. Find out the beginning of each group and create new groups.

 start <- c(TRUE, itemcode[-1] != itemcode[-length(itemcode)]) | !goodp f <- cumsum(start)

Summarize them as run length encoding and calculate the total

 r <- rle(f) x <- cumsum(x)

Then use the geometry to get the offset you need to correct with each built-in sum,

 offset <- c(0, x[cumsum(r$lengths)])

and calculate the updated value

 x - rep(offset[-length(offset)], r$lengths)

Here is the function

 cumsumByGroup <- function(x, f) { start <- c(TRUE, f[-1] != f[-length(f)]) | !x r <- rle(cumsum(start)) x <- cumsum(x) offset <- c(0, x[cumsum(r$lengths)]) x - rep(offset[-length(offset)], r$lengths) }

Here is the result applied to the sample data.

 > cumsumByGroup(goodp, itemcode) [1] 0 1 2 0 1 1 2 0 0 1 1 1 2 0 1

and performance

 > n <- 1 + rpois(1000000, 1) > goodp <- sample(c(0, 1), sum(n), TRUE) > itemcode <- rep(seq_along(n), n) > system.time(cumsumByGroup(goodp, itemcode)) user system elapsed 0.55 0.00 0.55

The dplyr solution takes about 70 seconds.

@Alexis_laz's solution is elegant and 2 times faster than mine

 cumsumByGroup1 <- function(x, f) { start <- c(TRUE, f[-1] != f[-length(f)]) | !x cs = cumsum(x) cs - cummax((cs - x) * start) }

+11

Martin morgan Mar 09 '16 at 17:50

source share

docendo discimus · Accepted Answer · 2016-03-09T18:31:14+0000

With the modified I / O example, you can use the following basic R approach (among others):

 transform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum)) # itemcode goodp cum.goodp cum.goodpX #1 a1 0 0 0 #2 a1 1 1 1 #3 a1 1 2 2 #4 a1 0 0 0 #5 a1 1 1 1 #6 a2 1 1 1 #7 a2 1 2 2 #8 a3 0 0 0 #9 a4 0 0 0 #10 a4 1 1 1 #11 a5 1 1 1 #12 a6 1 1 1 #13 a6 1 2 2 #14 a6 0 0 0 #15 a6 1 1 1

Note. I added a cum.goodp column to the df input and created a new cum.goodpX column so you can easily compare the two.

But, of course, you can use many other approaches with packages, either suggest @MartinMorgan, or, for example, use dplyr or data.table to name only two options. This can be much faster than the basic R approaches for large datasets.

Here's how to do it in dplyr:

 library(dplyr) df %>% group_by(itemcode, grp = cumsum(goodp == 0)) %>% mutate(cum.goodpX = cumsum(goodp))

In the comments to your question, the data.table parameter has already been specified.

Vectorize cumsum to R

More articles: