How to use biglm with over 2 ^ 31 observations

I work with a large dataset that contains more than 2 ^ 31 observations. The actual number of observations is close to 3.5 billion observations.

I use the "biglm" R package to run a regression with about 70 predictors. I read a million rows at a time and update the regression results. The data was saved in ffdf format using the R ffdf library for quick loading and not using all my RAM.

Here is the main outline of the code I'm using:

library(ff,ffbase,biglm) load.ffdf(dir='home') dim(data) #the ffdf contains about 70 predictors and 3.5 billion rows chunk_1 <- data[1:1000000,] rest_of_data <- data[1000000:nrow(data),] # Running biglm for first chunk b <- biglm(y~x1+x2+...+x70, chunk_1) chunks <- ceiling((nrow(rest_of_data)/1000000) # Updating biglm results by iterating through the rest of the data chunks for (i in seq(1,chunks)){ start <- 1+((i-1))*1000000 end <- min(i*1000000,nrow(d)) d_chunk <- d[start:end,] b<-update(b,d_chunk) } 

The results look great and everything works smoothly until the total number of observations from updating the model with each piece of data exceeds 2 ^ 31 observations. Then i get an error

 In object$n + NROW(mm) : NAs produced by integer overflow 

How do I get around this overflow problem? Thanks in advance for your help!

+5
source share
1 answer

I believe that I found the source of the problem in biglm code.

The number of observations ( n ) is stored as an integer that has a maximum value of 2^31 - 1 .

The numeric type does not fall under this restriction and, as far as I can tell, it can be used instead of integers to store n .

Here is a commit on github that shows how to fix this problem with one extra line of code that converts the integer n to numeric . As the model is updated, the number of lines in the new batch is added to the old n , so the type n remains numeric .

I was able to reproduce the error described in this question and make sure that my fix works with this code:

(WARNING: this consumes a lot of memory, consider doing more iterations with a smaller array if you have limited memory limits)

 library(biglm) df = as.data.frame(replicate(3, rnorm(10000000))) a = biglm(V1 ~ V2 + V3, df) for (i in 1:300) { a = update(a, df) } print(summary(a)) 

In the biglm source library, this code outputs:

 Large data regression model: biglm(ff, df) Sample size = NA Coef (95% CI) SE p (Intercept) -1e-04 NA NA NA NA V2 -1e-04 NA NA NA NA V3 -2e-04 NA NA NA NA 

My patched versions:

 Large data regression model: biglm(V1 ~ V2 + V3, df) Sample size = 3.01e+09 Coef (95% CI) SE p (Intercept) -3e-04 -3e-04 -3e-04 0 0 V2 -2e-04 -2e-04 -1e-04 0 0 V3 3e-04 3e-04 3e-04 0 0 

The values โ€‹โ€‹of SE and p are nonzero, only rounded in the output above.

I am new to the R ecosystem, so I would appreciate it if someone could tell me how to submit this patch so that it can be viewed by the original author and ultimately included in the upstream package.

+7
source

Source: https://habr.com/ru/post/1268752/


All Articles