The following function is able to extract p-values โโfrom a fit on a 13x20,000,000 matrix (for example, yours) in about 25 seconds.
pvalOnly2 <- function(fit) {
This computes the same p values โโas the call to the summary function (or the Benjamin pvalOnly function). However, it skips all the other steps that summary performs for each model, making it much faster. (Note that Benjamin pvalOnly calls vcov , which in turn calls summary , so it does not save time).
On a small matrix, this is about 30 times faster than a resume:
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38) mat <- matrix(m, nrow=13) groups <- rep(c("wildtype", "mutant"), times = c(10, 3)) fit <- lm(mat ~ groups) library(microbenchmark) microbenchmark(summary = do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4])), pvalOnly2(fit))
with the results:
Unit: microseconds expr min lq mean median uq max neval cld summary 3383.085 3702.238 3978.110 3919.0755 4147.4015 5475.223 100 b pvalOnly2(fit) 81.538 91.541 136.903 137.1275 157.5535 459.415 100 a
The speed advantage is much greater, however, when there are more models that you fit. On a 13x1000 matrix, it has an advantage of 300 times. And on my machine, when there are 20 million columns, it calculates p values โโin 25 seconds - twice as fast as the fit <- lm(mat ~ groups) step. Actually.
> mat <- mat[, rep(1:10, 2e6)] # just replicating same coefs > dim(mat) [1] 13 20000000 > system.time(fit <- lm(mat ~ groups)) user system elapsed 37.272 10.296 58.372 > system.time(pvals <- pvalOnly2(fit)) user system elapsed 21.945 1.889 24.322
The resulting p-values โโare correct (just as you left the resume):
> dim(pvals) [1] 2 20000000 > pvals[, 1:10] [,1] [,2] [,3] [,4] [,5] [,6] (Intercept) 0.006048267 0.01234835 0.02655251 0.0004555316 0.001004109 0.01608319 groupswildtype 0.129224604 0.22806894 0.88113522 0.2064583345 0.103624361 0.84642990 [,7] [,8] [,9] [,10] (Intercept) 0.0004630405 0.1386393 0.05107805 5.042796e-05 groupswildtype 0.2717139022 0.1539826 0.66351492 5.942893e-02
(By the way, profiling shows that almost all the time spent in the function is spent in the pt function, since this is done in C, it is about as fast as it can be done in any language).
In response to your comment, you can also get a p-value for each model (from F-statistics) with the following function, which is equal to pvalOnly2 speed:
modelPvalOnly <- function(fit) { f <- t(fit$fitted.values) if (attr(fit$terms, "intercept")) { mss <- rowSums((f - rowMeans(f)) ^ 2) numdf <- fit$rank - 1 } else { mss <- rowSums(f ^ 2) numdf <- fit$rank } resvar <- colSums(fit$residuals^2) / fit$df.residual fstat <- (mss / numdf) / resvar pval <- pf(fstat, numdf, fit$df.residual, lower.tail = FALSE) pval }