Fast matrix subset in R

I am faced with the following problem: I need many subsets of a large matrix. Actually, I just need representations as input for another f () function, so I don't need to change the values. However, it seems that R is too heavy for this task, or I'm doing something wrong (which seems more likely). The toy example illustrates how long it takes to select the columns and then use them in another function (in this example, the toy uses the primitive sum () function). As a "benchmark", I also test the calculation time of the summation of the entire matrix, which is surprisingly faster. I also experimented with the ref package, but was unable to achieve any performance improvements. So, the main question: how to multiply the matrix without copying it? I appreciate any help, thanks!

library(microbenchmark) library(ref) m0 <- matrix(rnorm(10^6), 10^3, 10^3) r0 <- refdata(m0) microbenchmark(m0[, 1:900], sum(m0[, 1:900]), sum(r0[,1:900]), sum(m0)) 
 Unit: milliseconds expr min lq mean median uq m0[, 1:900] 10.087403 12.350751 16.697078 18.307475 19.054157 sum(m0[, 1:900]) 11.067583 13.341860 17.286514 19.123748 19.990661 sum(r0[, 1:900]) 11.066164 13.194244 16.869551 19.204434 20.004034 sum(m0) 1.015247 1.040574 1.059872 1.049513 1.067142 max neval 58.238217 100 25.664729 100 23.505308 100 1.233617 100 

The control task of summing the entire matrix takes 1.059872 milliseconds and is about 16 times faster than other functions.

+5
source share
2 answers

The problem with your solution is that the subset allocates another matrix that takes time.

You have two solutions:

If the time spent using sum on the whole matrix is ​​good with you, you can use colSums on the whole matrix and multiply the result:

 sum(colSums(m0)[1:900]) 

Or you can use Rcpp to calculate sum with a subset without copying the matrix.

 #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] double sumSub(const NumericMatrix& x, const IntegerVector& colInd) { double sum = 0; for (IntegerVector::const_iterator it = colInd.begin(); it != colInd.end(); ++it) { int j = *it - 1; for (int i = 0; i < x.nrow(); i++) { sum += x(i, j); } } return sum; } microbenchmark(m0[, 1:900], sum(m0[, 1:900]), sum(r0[,1:900]), sum(m0), sum(colSums(m0)[1:900]), sumSub(m0, 1:900)) Unit: milliseconds expr min lq mean median uq max neval m0[, 1:900] 4.831616 5.447749 5.641096 5.675774 5.861052 6.418266 100 sum(m0[, 1:900]) 6.103985 6.475921 7.052001 6.723035 6.999226 37.085345 100 sum(r0[, 1:900]) 6.224850 6.449210 6.728681 6.705366 6.943689 7.565842 100 sum(m0) 1.110073 1.145906 1.175224 1.168696 1.197889 1.269589 100 sum(colSums(m0)[1:900]) 1.113834 1.141411 1.178913 1.168312 1.201827 1.408785 100 sumSub(m0, 1:900) 1.337188 1.368383 1.404744 1.390846 1.415434 2.459361 100 

You can use the deploy optimization to further optimize the version of Rcpp.

+4
source

Using compiler , I wrote a function that gets the result about 2 times faster than your other methods (8x is the sum(m0) value instead of 16x):

 require(compiler) compiler_sum <- cmpfun({function(x) { tmp <- 0 for (i in 1:900) tmp <- tmp+sum(x[,i]) tmp }}) microbenchmark( sum(m0), compiler_sum(m0) ) 
 Unit: milliseconds expr min lq mean median uq max sum(m0) 1.016532 1.056030 1.107263 1.084503 1.11173 1.634391 compiler_sum(m0) 7.655251 7.854135 8.000521 8.021107 8.29850 16.760058 neval 100 100 
0
source

Source: https://habr.com/ru/post/1271457/


All Articles