How to speed up GLM assessment?

I am using RStudio 0.97.320 (R 2.15.3) on Amazon EC2. My data frame has 200k rows and 12 columns.

I am trying to reconcile logistic regression with approximately 1,500 parameters.

R uses 7% of the CPU and has 60 + GB of memory and still takes a lot of time.

Here is the code:

glm.1.2 <- glm(formula = Y ~ factor(X1) * log(X2) * (X3 + X4 * (X5 + I(X5^2)) * (X8 + I(X8^2)) + ((X6 + I(X6^2)) * factor(X7))), family = binomial(logit), data = df[1:150000,]) 

Any suggestions to expedite this by a significant amount?

+7
source share
3 answers

You can try the speedglm function from the speedglm package. I did not use it for the big problems you are talking about, but especially if you are installing the BLAS library (as @Ben Bolker explained in the comments), it should be easy to use and give you good speed.

I remember seeing the glm and speedglm comparison table with and without the BLAS library, but I cannot find it today. I remember convincing me that I needed both BLAS and speedglm .

+7
source

Although a bit late, I can only encourage dickoa to generate a sparse model matrix using the Matrix package and then feed it to speedglm.wfit. This works great ;-) Thus, I was able to launch a logistic regression on a 1e6 x 3500 model matrix in less than 3 minutes.

+4
source

Assuming your design matrix is โ€‹โ€‹not rare, you can also consider my parglm package. See this vignette for a comparison of calculation time and further details. I show here a comparison of computational time on a related issue .

One of the methods in the parglm function works like the bam function in mgcv . The method is described in detail in

Wood, SN, Goude, Y. & Shaw S. (2015) Generalized additive models for large data sets. Journal of the Royal Statistical Society, Series C 64 (1): 139-155.

The advantage of the method is that it can be implemented using a non-parallel QR implementation and at the same time perform calculations in parallel. Another advantage is the potentially lower memory footprint. This is used in the mgcv bam function and can also be implemented here with customization, as in the speedglm shglm function.

+1
source

Source: https://habr.com/ru/post/943896/


All Articles