Increase model learning speed in carriage (R)

I have a data set consisting of 20 functions and approximately 300,000 observations. I use a carriage to train a model with doParallel and four cores. Even training on 10% of my data takes more than eight hours for the methods I tried (rf, nnet, adabag, svmPoly). I will oversample with loading 3 times, and my tuneLength is 5. Can I do something to speed up this painfully slow process? Someone suggested using the base library to speed up my process up to 10 times, but before I went down this route, I would like to make sure that there is no alternative.

+5
source share
3 answers

phiver hits the nail on the head, but there are a few things to offer for this situation:

  • make sure you don't run out of system memory using parallel processing. You are using X extra copies of data in memory when using X workers.
  • with class imbalance, extra sampling can help. Lowering the sampling rate can help improve performance and reduce time.
  • use different libraries. ranger instead of randomForest , xgboost or C5.0 instead of gbm . You must understand that ensemble methods are suitable for tons of component models and inevitably take time.
  • the package has a racing type algorithm for setting parameters in less time
  • The github development version has random search methods for models with a lot of settings.

Max

+9
source

What people forget when comparing the basic model and the use of the carriage is that the carriage has a lot of extra things.

Take, for example, your random forest. therefore bootstrap, number 3 and tuneLength 5. So you solve 3 times, and because of tuneLength you are trying to find a good value for mtry. In total, you run 15 random forests and compare them to get the best for the final model, but only 1 if you use the basic random forest model.

You also work in parallel on 4 cores, and randomforest - all available observations, so all your training observations will be 4 times in memory. There probably is not much memory left to train the model.

My advice is to start shrinking to see if you can speed up the process, for example, set the boot number to 1 and set the length to the default value of 3. Or even set the traincontrol method to "none", just to get an idea of ​​how much quickly the model is at the minimum settings and does not require re-sampling.

+9
source

Great inputs from @phiver and @topepo. I will try to summarize and add a few more points that I have compiled from a small number of SO posts that I made for a similar problem:

  • Yes, parallel processing takes longer with less memory. When using 8 cores and 64 GB of RAM, a rule of thumb may be to use 5-6 people at best.
  • @topepo on preprocessing here is fantastic. It is step-by-step instructive and helps replace manual preprocessing work, such as dummy variables, removing multi-colonial / linear combinations of variables and transforms.
  • One of the reasons randomforest and other models become very slow is due to the number of factors in categorical variables. This is either recommended for club factors, or, if possible, converted to ordinal / numerical conversion.
  • Try to use the Tunegrid function in the carriage to the full for ensemble models. Start with the smallest mtry / ntree values ​​for fetching data and see how this works in terms of improving accuracy improvements.
  • I found out this SO page to be very useful when parRF is first suggested. I have not greatly improved my dataset by replacing RF parRF, but you can try. Other suggestions include using data.table instead of data and using prediction / response data instead of formula. This greatly improves the speed, believe me (but there is a caution, the performance of the prediction / response data (providing x = X, y = Y data.tables) also seems to somehow improve the predictive accuracy somehow and modifies the Variable important table from decompose by the formula (Y ~.).
+1
source

Source: https://habr.com/ru/post/1232762/


All Articles