How to sum grouped elements of a vector in R

Question

How to sum grouped elements of a vector in R

Say I have this vector

v <- c(1:100)

And I want to get the following:

 b[1] = sum (v[c(1:10)]) b[2] = sum (v[c(11:20)]) ... ...

I can do a loop to solve this problem, but I'm sure there is an "R-path" that should be something like:

 b <- groupedSum(v, 10)

where b will be a vector that will have each group of 10 summed. What is the path of R?

+2

r

jordi Feb 21 '12 at 14:14

source share

5 answers

Step 1: Create an index for the groups:

 N <- 50 size <- 10 # Size of a group v <- seq(N) idx <- as.factor(rep(seq(N/size),each=size))

Step 2. Use any number of vectorized tools (by, plyr, etc.) to sum over the groups:

 by(v,idx,sum)

Step 3: Profit

 idx: 1 [1] 55 --------------------------------------------------------------------------------- idx: 2 [1] 155 --------------------------------------------------------------------------------- idx: 3 [1] 255 --------------------------------------------------------------------------------- idx: 4 [1] 355 --------------------------------------------------------------------------------- idx: 5 [1] 455

+3

Ari B. Friedman Feb 21 '12 at 14:20

source share

There are already two good methods. I suggest using a cut to give you a range in output:

 v <- c(1:100) dat <- data.frame(v=v, cat = cut(v, seq(0, 100, by=10))) aggregate(v~cat, data=dat, sum)

Yielding:

  cat v 1 (0,10] 55 2 (10,20] 155 3 (20,30] 255 4 (30,40] 355 5 (40,50] 455 6 (50,60] 555 7 (60,70] 655 8 (70,80] 755 9 (80,90] 855 10 (90,100] 955

+2

Tyler rinker Feb 21 '12 at 14:59

source share

A faster method (20-300 times faster compared to the above methods) for large data sets is to cast as a matrix, and then use colSums.

 > colSums( matrix( v, nrow = 10, ncol = 10 )) [1] 55 155 255 355 455 555 655 755 855 955

Consider a larger dataset

 > n_per_group = 1e3 > n_groups = 1e3; > v = 1:(n_per_group * n_groups)

using the matrix method, takes 5 ms

 > start = Sys.time(); > r1 =colSums( matrix( v, nrow = n_per_group, ncol = n_groups )) > end = Sys.time() > end-start Time difference of 0.005604982 secs

using tapply method takes 601 ms

 > start = Sys.time(); > r2 = as.numeric( tapply( v, (seq_along( v ) - 1) %/% n_per_group, sum ) ) > end = Sys.time() > end-start Time difference of 0.6015229 secs > all.equal( r1, r2) [1] TRUE

using the by 103ms method

 > start = Sys.time(); > idx = as.factor( rep( seq( n_groups ), each = n_per_group ) ) > r3 = as.numeric(by(v,idx,sum)) > end = Sys.time() > end-start Time difference of 0.1034958 secs > all.equal( r1, r3) [1] TRUE

using dataframe method requires 1675 ms

 > start = Sys.time(); > dat <- data.frame(v=v, cat = cut(v, seq(0, n_per_group * n_groups, by= n_per_group ))) > r4 = aggregate(v~cat, data=dat, sum)$v > end = Sys.time() > end-start Time difference of 1.675465 secs > all.equal( r1, r4) [1] TRUE

and using the spare parts matrix method takes 334 ms

 > library( Matrix ) > start = Sys.time(); > f = gl( n_groups, n_per_group ) > r5 = as( f, "sparseMatrix" ) %*% v > r5 = as.numeric( r5[ , 1 ] ) > end = Sys.time() > end-start Time difference of 0.334847 secs > all.equal( r1, r5) [1] TRUE

+1

Rob Apr 28 '17 at 15:49

source share

This solution requires the Matrix library.

 v <- seq(100)# example data f <- gl(10,10)# generate factor for grouping v_sums <- as(f,"sparseMatrix") %*% v

0

Wojciech sobala May 26 '12 at 6:07

source share

42- · Accepted Answer · 2012-02-21T14:30:58+0000

 > tapply( v, (seq_along(v)-1) %/% 10, sum) 0 1 2 3 4 5 6 7 8 9 55 155 255 355 455 555 655 755 855 955

If there was NA, you might need to add na.rm = TRUE to the argument list after sum .

Comments: I think Tyler's approach is more perfect because he provided better documentation. He suffers from having to work with the vagaries of the cut() function, which I always felt, having the wrong default values. To create a group that captures all 1: 100, he needs to use the vector element 101. But this is not Tyler's mistake. Send him further votes, his answer is better.

If gsk can use by-objects without encountering the difficulties of a class, it is better than me. The result looks like a list, but it really is something else. Using his example:

 > is.list(by(v,idx,sum)) [1] FALSE > is.matrix(by(v,idx,sum)) [1] FALSE > is.vector(by(v,idx,sum)) [1] FALSE

I think by-objects are like named vectors and kind of like matrices, but the inability to inherit the matrix class always confused me.

How to sum grouped elements of a vector in R

More articles: