What is the fastest way to apply t.test to each column of a large matrix?

Suppose I have a large matrix:

M <- matrix(rnorm(1e7),nrow=20) 

Next, suppose that each column represents a pattern. Say I would like to apply t.test() to each column, is there a way to do this much faster than using apply() ?

 apply(M, 2, t.test) 

It took a little less than 2 minutes to run the analysis on my computer:

 > system.time(invisible( apply(M, 2, t.test))) user system elapsed 113.513 0.663 113.519 
+6
source share
2 answers

If you have a multi-core machine, there are some advantages to using all cores, for example, using mclapply .

 > library(multicore) > M <- matrix(rnorm(40),nrow=20) > x1 <- apply(M, 2, t.test) > x2 <- mclapply(1:dim(M)[2], function(i) t.test(M[,i])) > all.equal(x1, x2) [1] "Component 1: Component 9: 1 string mismatch" "Component 2: Component 9: 1 string mismatch" # str(x1) and str(x2) show that the difference is immaterial 

This mini-example shows that everything is going as we planned. Now zoom in:

 > M <- matrix(rnorm(1e7), nrow=20) > system.time(invisible(apply(M, 2, t.test))) user system elapsed 101.346 0.626 101.859 > system.time(invisible(mclapply(1:dim(M)[2], function(i) t.test(M[,i])))) user system elapsed 55.049 2.527 43.668 

This is the use of 8 virtual cores. Your mileage may vary. Not a huge gain, but it brings very little effort.

EDIT

If you care only about t-statistics itself, extracting the corresponding field ( $statistic ) makes things a little faster, in particular in the multi-core case:

 > system.time(invisible(apply(M, 2, function(c) t.test(c)$statistic))) user system elapsed 80.920 0.437 82.109 > system.time(invisible(mclapply(1:dim(M)[2], function(i) t.test(M[,i])$statistic))) user system elapsed 21.246 1.367 24.107 

Or even faster, calculate the value of t directly

 my.t.test <- function(c){ n <- sqrt(length(c)) mean(c)*n/sd(c) } 

Then

 > system.time(invisible(apply(M, 2, function(c) my.t.test(c)))) user system elapsed 21.371 0.247 21.532 > system.time(invisible(mclapply(1:dim(M)[2], function(i) my.t.test(M[,i])))) user system elapsed 144.161 8.658 6.313 
+8
source

You can do better than this by using the colttests function from the genefilter package (on Bioconductor).

 > library(genefilter) > M <- matrix(rnorm(40),nrow=20) > my.t.test <- function(c){ + n <- sqrt(length(c)) + mean(c)*n/sd(c) + } > x1 <- apply(M, 2, function(c) my.t.test(c)) > x2 <- colttests(M, gl(1, nrow(M)))[,"statistic"] > all.equal(x1, x2) [1] TRUE > M <- matrix(rnorm(1e7), nrow=20) > system.time(invisible(apply(M, 2, function(c) my.t.test(c)))) user system elapsed 27.386 0.004 27.445 > system.time(invisible(colttests(M, gl(1, nrow(M)))[,"statistic"])) user system elapsed 0.412 0.000 0.414 

Link: "Computing thousands of test statistics simultaneously in R", SCGN, Volume 18 (1), 2007, http://stat-computing.org/newsletter/issues/scgn-18-1.pdf .

+8
source

Source: https://habr.com/ru/post/920371/


All Articles