Regression in R-4, 4 million copies

I have a text file in the form (User Id, Movie Id, Ratings, Time), and I want to do varial regression in the data set. (Total 4 functions,> 4 million copies)

model <- glm ( UserId ~ MovieId+Ratings+Time,data=<name>) 

He gave an error:

 ERROR: cannot allocate 138.5MB vector . 

File size is only 93 MB. How to make a regression with R and not have memory problems? Should I store data in different ways?

Thanks.

Additional information: work with Linux-operator with 3 GB of RAM. I have googled around, but most of the links I talked about datasets, which are usually> RAM, which is not so in my case :( (total 93 MB).

+6
source share
3 answers

biglm is a package specifically designed to fit regression models to large datasets.

It works by processing data block by block. The amount of memory required depends on the number of variables, but does not depend on the number of observations.

+8
source

The required model matrix has the same number of rows as your data, but the number of columns in it is approximately equal to the number of unique rows (factor levels)!

So, if you have 1000 films that will generate approximately 4x6x1000 matrix of doubles. This is approximately 32 GB ...

You can try to create the model matrix separately as follows:

 # Sample of 100 rows, 10 users, 20 movies d <- data.frame(UserId = rep(paste('U', 10), each=10), MovieId=sample(paste('M', 1:20), 100, replace=T), Ratings=runif(100), Time=runif(100, 45, 180)) dim(d) # 100 x 4 m <- model.matrix(~ MovieId+Ratings+Time, data=d) dim(m) # 100 x 21 
+3
source

This error message R does not apply to the total memory, but they tried to allocate to the last fragment of R and failed to execute. You can try to profile memory usage ( Use memory usage monitoring in R ) to see what actually happens.

+1
source

Source: https://habr.com/ru/post/892675/


All Articles