Quickly extract pvalues โ€‹โ€‹from multiple lm () in R

I have a matrix (mat) with dims "13, 20,000,000" and the following groups

[1,] "wildtype" [2,] "wildtype" [3,] "wildtype" [4,] "wildtype" [5,] "wildtype" [6,] "wildtype" [7,] "wildtype" [8,] "wildtype" [9,] "wildtype" [10,] "wildtype" [11,] "mutant" [12,] "mutant" [13,] "mutant" 

With the following R code, I run lm() 20M times for each data point.

lm(mat ~ groups) really fast. What takes a lot of time is extracting a pvalue for each model using summary(lm1) .

How can I speed up the extraction of pvalues?

 tvals_out <-'/tmp/tvals_lm.csv' infile <- '/tmp/tempdata.dat' con <- file(infile, "rb") dim <- readBin(con, "integer", 2) mat <- matrix( readBin(con, "numeric", prod(dim)), dim[1], dim[2]) close(con) groups = factor(c(rep('wt', 10), rep('mut', 3))) lm1 <- lm(mat ~ groups) # This is the longest running bit sum_lm1 <- summary(lm1) num_pixels <- dim(mat)[2] result_pvalues <- numeric(num_pixels) result_pvalues <- vapply(sum_lm1, function(x) x$coefficients[,4][2], FUN.VALUE = 1) write.table(result_pvalues, tvals_out, sep=','); outCon <- file(tvals_out, "wb") writeBin(result_pvalues, outCon) close(outCon) 

edit:

I added an approximate data bit of 10 data points from the mat object

 m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38) mat <- matrix(m, nrow=13) 
+5
source share
4 answers

The following function is able to extract p-values โ€‹โ€‹from a fit on a 13x20,000,000 matrix (for example, yours) in about 25 seconds.

 pvalOnly2 <- function(fit) { # get estimates est <- fit$coefficients[fit$qr$pivot, ] # get R: see stats:::summary.lm to see how this is calculated p1 <- 1L:(fit$rank) R <- diag(chol2inv(fit$qr$qr[p1, p1, drop = FALSE])) # get residual sum of squares for each resvar <- colSums(fit$residuals^2) / fit$df.residual # R is same for each coefficient, resvar is same within each model se <- sqrt(outer(R, resvar)) pt(abs(est / se), df = fit$df.residual, lower.tail = FALSE) * 2 } 

This computes the same p values โ€‹โ€‹as the call to the summary function (or the Benjamin pvalOnly function). However, it skips all the other steps that summary performs for each model, making it much faster. (Note that Benjamin pvalOnly calls vcov , which in turn calls summary , so it does not save time).

On a small matrix, this is about 30 times faster than a resume:

 m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38) mat <- matrix(m, nrow=13) groups <- rep(c("wildtype", "mutant"), times = c(10, 3)) fit <- lm(mat ~ groups) library(microbenchmark) microbenchmark(summary = do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4])), pvalOnly2(fit)) 

with the results:

 Unit: microseconds expr min lq mean median uq max neval cld summary 3383.085 3702.238 3978.110 3919.0755 4147.4015 5475.223 100 b pvalOnly2(fit) 81.538 91.541 136.903 137.1275 157.5535 459.415 100 a 

The speed advantage is much greater, however, when there are more models that you fit. On a 13x1000 matrix, it has an advantage of 300 times. And on my machine, when there are 20 million columns, it calculates p values โ€‹โ€‹in 25 seconds - twice as fast as the fit <- lm(mat ~ groups) step. Actually.

 > mat <- mat[, rep(1:10, 2e6)] # just replicating same coefs > dim(mat) [1] 13 20000000 > system.time(fit <- lm(mat ~ groups)) user system elapsed 37.272 10.296 58.372 > system.time(pvals <- pvalOnly2(fit)) user system elapsed 21.945 1.889 24.322 

The resulting p-values โ€‹โ€‹are correct (just as you left the resume):

 > dim(pvals) [1] 2 20000000 > pvals[, 1:10] [,1] [,2] [,3] [,4] [,5] [,6] (Intercept) 0.006048267 0.01234835 0.02655251 0.0004555316 0.001004109 0.01608319 groupswildtype 0.129224604 0.22806894 0.88113522 0.2064583345 0.103624361 0.84642990 [,7] [,8] [,9] [,10] (Intercept) 0.0004630405 0.1386393 0.05107805 5.042796e-05 groupswildtype 0.2717139022 0.1539826 0.66351492 5.942893e-02 

(By the way, profiling shows that almost all the time spent in the function is spent in the pt function, since this is done in C, it is about as fast as it can be done in any language).

In response to your comment, you can also get a p-value for each model (from F-statistics) with the following function, which is equal to pvalOnly2 speed:

 modelPvalOnly <- function(fit) { f <- t(fit$fitted.values) if (attr(fit$terms, "intercept")) { mss <- rowSums((f - rowMeans(f)) ^ 2) numdf <- fit$rank - 1 } else { mss <- rowSums(f ^ 2) numdf <- fit$rank } resvar <- colSums(fit$residuals^2) / fit$df.residual fstat <- (mss / numdf) / resvar pval <- pf(fstat, numdf, fit$df.residual, lower.tail = FALSE) pval } 
+3
source

How about trying broom to try?

 install.packages(broom) library(broom) tidy(lm(mat ~ groups)) # response term estimate std.error statistic p.value # 1 Y1 (Intercept) 27.000000 7.967548 3.3887465 6.048267e-03 # 2 Y1 groupswt 14.900000 9.084402 1.6401740 1.292246e-01 # 3 Y2 (Intercept) 23.333333 7.809797 2.9877004 1.234835e-02 # 4 Y2 groupswt 11.366667 8.904539 1.2765026 2.280689e-01 # 5 Y3 (Intercept) 44.000000 17.192317 2.5592828 2.655251e-02 # ...and more... 

Then, to extract only the results for groupswt (note: various ways to accomplish this ...):

 subset(tidy(lm(mat ~ groups)), term == "groupswt")[, c(1,6)] # response p.value # 2 Y1 0.12922460 # 4 Y2 0.22806894 # 6 Y3 0.88113522 # 8 Y4 0.20645833 # 10 Y5 0.10362436 # 12 Y6 0.84642990 # 14 Y7 0.27171390 # 16 Y8 0.15398258 # 18 Y9 0.66351492 # 20 Y10 0.05942893 
+3
source

Itโ€™s hard for me to imagine what would be faster than summary . In the interest of the attempt, I wrote a quick diddy to calculate the p-value from the coefficients and standard errors. I also tried the broom approach. Results based on sample data below

 m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38) mat <- matrix(m, nrow=13) groups <- rep(c("wildtype", "mutant"), times = c(10, 3)) fit <- lm(mat ~ groups) #* Using summary do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4])) #* Directly calculating p-value pvalOnly <- function(fit){ pt(abs(coef(fit) / sqrt(diag(vcov(fit)))), df = fit$df.residual, lower.tail = FALSE) * 2 } pvalDirect <- pvalOnly(fit) #* Using broom library(broom) tidy(fit)$p.value library(microbenchmark) microbenchmark( summary = do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4])), direct = pvalOnly(fit), broom = tidy(fit)$p.value ) 

As you can see, in this very small representation, using summary is still a little faster than direct calculation. broom adds a lot of time (not surprising, since it works hard to remove what you are not trying to capture)

 Unit: milliseconds expr min lq mean median uq max neval cld summary 1.685857 1.744652 1.969350 1.804914 1.877931 4.929129 100 a direct 1.860630 1.933501 2.184573 2.047279 2.160765 6.442852 100 a broom 5.303015 5.557257 6.060014 5.818830 5.999028 9.879372 100 b 
+3
source

I have a script where I do a bunch of regressions and then collect coefficients, including p-values. That's what it looks like

 library(data.table) summ<-summary(lm1)$coefficients coeffs<-data.table(summ) coeffs[,coef:=row.names(summ)] setnames(coeffs,c("estimate", "stderr","t","p","coef")) 
+1
source

Source: https://habr.com/ru/post/1235705/


All Articles