How to use biglm with over 2 ^ 31 observations

Question

How to use biglm with over 2 ^ 31 observations

I work with a large dataset that contains more than 2 ^ 31 observations. The actual number of observations is close to 3.5 billion observations.

I use the "biglm" R package to run a regression with about 70 predictors. I read a million rows at a time and update the regression results. The data was saved in ffdf format using the R ffdf library for quick loading and not using all my RAM.

Here is the main outline of the code I'm using:

library(ff,ffbase,biglm) load.ffdf(dir='home') dim(data) #the ffdf contains about 70 predictors and 3.5 billion rows chunk_1 <- data[1:1000000,] rest_of_data <- data[1000000:nrow(data),] # Running biglm for first chunk b <- biglm(y~x1+x2+...+x70, chunk_1) chunks <- ceiling((nrow(rest_of_data)/1000000) # Updating biglm results by iterating through the rest of the data chunks for (i in seq(1,chunks)){ start <- 1+((i-1))*1000000 end <- min(i*1000000,nrow(d)) d_chunk <- d[start:end,] b<-update(b,d_chunk) }

The results look great and everything works smoothly until the total number of observations from updating the model with each piece of data exceeds 2 ^ 31 observations. Then i get an error

 In object$n + NROW(mm) : NAs produced by integer overflow

How do I get around this overflow problem? Thanks in advance for your help!

+5

r bigdata

Eunice Jun 11 '17 at 17:43

source share

1 answer

Eric · Accepted Answer · 2017-07-04T01:38:05+0000

I believe that I found the source of the problem in biglm code.

The number of observations ( n ) is stored as an integer that has a maximum value of 2^31 - 1 .

The numeric type does not fall under this restriction and, as far as I can tell, it can be used instead of integers to store n .

Here is a commit on github that shows how to fix this problem with one extra line of code that converts the integer n to numeric . As the model is updated, the number of lines in the new batch is added to the old n , so the type n remains numeric .

I was able to reproduce the error described in this question and make sure that my fix works with this code:

(WARNING: this consumes a lot of memory, consider doing more iterations with a smaller array if you have limited memory limits)

 library(biglm) df = as.data.frame(replicate(3, rnorm(10000000))) a = biglm(V1 ~ V2 + V3, df) for (i in 1:300) { a = update(a, df) } print(summary(a))

In the biglm source library, this code outputs:

 Large data regression model: biglm(ff, df) Sample size = NA Coef (95% CI) SE p (Intercept) -1e-04 NA NA NA NA V2 -1e-04 NA NA NA NA V3 -2e-04 NA NA NA NA

My patched versions:

 Large data regression model: biglm(V1 ~ V2 + V3, df) Sample size = 3.01e+09 Coef (95% CI) SE p (Intercept) -3e-04 -3e-04 -3e-04 0 0 V2 -2e-04 -2e-04 -1e-04 0 0 V3 3e-04 3e-04 3e-04 0 0

The values of SE and p are nonzero, only rounded in the output above.

I am new to the R ecosystem, so I would appreciate it if someone could tell me how to submit this patch so that it can be viewed by the original author and ultimately included in the upstream package.

How to use biglm with over 2 ^ 31 observations

More articles: