Effective classifiers of memory in R for extremely wide and not too long training sets

The set of training data is extremely wide (about 200 thousand functions) and very short (in hundreds). Obviously, the data set takes up a lot of memory, but R reads it without any problems.

Then I prepared the Random Forest classifier and it ended with a lack of memory. So I switched to a simpler classifier such as Naive Bayes. NB also caused a lack of memory.

Typically, what are the most memory efficient classifiers? I suspect that logistic regression and Naive Bayes should make a list ...

UPD:

As a result, I used methods to reduce opportunities before using a random forest. The caret package may help, but not with the initial number of variables in my case.

Function abbreviations used:

  • dispersion threshold filter (distant objects with dispersion below the threshold value);
  • correlation between characteristics and predicted values: remote functions with low correlation;
  • Parallel correlations: elimination of high pair correlation functions.
+4
source share
3 answers

The most memory efficient algorithms are those based on online learning (which do not load the entire data set into memory but learn one instance at a time) and function hashing , also called a hashing trick (which can turn arbitrary large object vectors into a predefined / fixed size by hashing), Logistic regression and linear SVM have interactive training and functional hashing implementations (which comes down to optimization for loss of logistics or loss on Lyakh, respectively).

I don’t know about the implementation in R (maybe I just don’t know the R-libraries well engouh), but a very strong and widely used student using these methods is Vowpal Wabbit . They are also implemented in Scikit-Learn .

+1
source

Here is a Cornell CS document that compares the performance of different classifiers. It does not reach speed, but it exceeds the predictive ability of almost all classification algorithms that are widely used today. The fastest of these will be algorithms that are not ensemble classifiers of learning. Any algorithm that builds several models and averages the results will essentially take longer. However, as can be seen in Table 2 on page 5, ensemble methods are the most effective classifiers. If you want to build the model as quickly as possible, then you should probably use only one decision tree or logistic regression. Otherwise, you need to spend some time learning the ensemble's learning techniques and figure out how to optimize the speed of this particular algorithm. I got good results by parallelizing my random forests using a technique similar to this .

Edit to more accurately solve your memory problems: Memory usage is less than what you choose than how you use this algorithm. Assuming you used the default random left call for your original model, you would build 500 decision trees, each of which has ~ 450 predictor variables and as many terminal nodes as you have data points in the sample. It takes a whole bunch of memory. What I'm trying to do is that you can customize any of these classification models to reduce the amount of memory and work more efficiently in R. As mentioned earlier, methods without ensemble (logistic regression, naive bays, CHAID / CART / etc) will use at least the default memory.

+1
source

The glmnet package can handle sparse matrices and will have more memory than an ensemble, but still offers a choice of variables (via a lasso / elastic mesh). The code might look like this:

 library(glmnet) df <- read.csv() #data X <- sparse.model.matrix( ~ . - y, df) #Matrix with all variables in df except the y variable. y <- df$y model <- cv.glmnet(X, y, nfolds = 10, family='binomial' ) 
+1
source

Source: https://habr.com/ru/post/1483229/


All Articles