Learning curve plot with map and R

I would like to study the optimal compromise between displacement / dispersion for model tuning. I use fort for R, which allows me to build a performance metric (AUC, accuracy ...) against model hyperparameters (mtry, lambda, etc.) and automatically selects max. This usually returns a good model, but if I want to dig further and choose a different offset / variance tradeoff, I need a learning curve, not a performance curve.

For simplicity, suppose my model is a random forest that has only one mtry hyperparameter.

I would like to build learning curves for both training and test sets. Something like that:

learning curve

(the red curve is a test suite)

An error metric is placed on the y axis (the number of erroneous examples or something like that); on the x axis "mtry" or, alternatively, the size of the training set.

Questions:

  • Is it possible to carry out iterative training of models based on sets of workouts of different sizes? If I need to manually execute the code, how can I do this?

  • If I want to put the hyperparameter on the x axis, I need all the models trained on the :: train caret, and not just the final model (the one that has maximum performance after CV). Is this โ€œdiscardedโ€ model still available after the train?

+8
source share
3 answers
  • Caret will iteratively test many cv models for you if you set trainControl() and parameters (e.g. mtry) using tuneGrid() . Then both of them are passed as control parameters to the train() function. The specifics of tuneGrid parameters (for example, mtry, ntree) will differ for each type of model.

  • Yes, the latest trainFit will contain the error rate (no matter how you specify it) for all the folds of your resume.

That way you can specify, for example. 10x CV times grid with 10 mtry values, which will be 100 iterations. You might want to go for a cup of tea or maybe have lunch.

If that sounds complicated ... there is a very good example here - the carriage is one of the best documented packages.

+4
source

Here is my code on how I approached this problem of building a learning curve in R , using Caret to train your model. I use Motor Trend Car Road Tests in R for illustrative purposes. First, I randomized and split the mtcars into training and test sets. 21 entries for training and 13 entries for test suite. In this example, the mpg response function.

 # set seed for reproducibility set.seed(7) # randomize mtcars mtcars <- mtcars[sample(nrow(mtcars)),] # split iris data into training and test sets mtcarsIndex <- createDataPartition(mtcars$mpg, p = .625, list = F) mtcarsTrain <- mtcars[mtcarsIndex,] mtcarsTest <- mtcars[-mtcarsIndex,] # create empty data frame learnCurve <- data.frame(m = integer(21), trainRMSE = integer(21), cvRMSE = integer(21)) # test data response feature testY <- mtcarsTest$mpg # Run algorithms using 10-fold cross validation with 3 repeats trainControl <- trainControl(method="repeatedcv", number=10, repeats=3) metric <- "RMSE" # loop over training examples for (i in 3:21) { learnCurve$m[i] <- i # train learning algorithm with size i fit.lm <- train(mpg~., data=mtcarsTrain[1:i,], method="lm", metric=metric, preProc=c("center", "scale"), trControl=trainControl) learnCurve$trainRMSE[i] <- fit.lm$results$RMSE # use trained parameters to predict on test data prediction <- predict(fit.lm, newdata = mtcarsTest[,-1]) rmse <- postResample(prediction, testY) learnCurve$cvRMSE[i] <- rmse[1] } pdf("LinearRegressionLearningCurve.pdf", width = 7, height = 7, pointsize=12) # plot learning curves of training set size vs. error measure # for training set and test set plot(log(learnCurve$trainRMSE),type = "o",col = "red", xlab = "Training set size", ylab = "Error (RMSE)", main = "Linear Model Learning Curve") lines(log(learnCurve$cvRMSE), type = "o", col = "blue") legend('topright', c("Train error", "Test error"), lty = c(1,1), lwd = c(2.5, 2.5), col = c("red", "blue")) dev.off() 

The output chart looks like this: MtCarsLearningCurve.png

+3
source

At some point, probably after this question was asked, caret added the learning_curve_dat function, which helps to evaluate the modelโ€™s performance in the range of sizes of the training set.

Here is an example from the function documentation:

 library(caret) set.seed(1412) class_dat <- twoClassSim(1000) set.seed(29510) # NOTE learing_curve_dat below is not a typo lda_data <- learing_curve_dat(dat = class_dat, outcome = "Class", test_prop = 1/4, ## 'train' arguments: method = "lda", metric = "ROC", trControl = trainControl(classProbs = TRUE, summaryFunction = twoClassSummary)) ggplot(lda_data, aes(x = Training_Size, y = ROC, color = Data)) + geom_smooth(method = loess, span = .8) 

Performance metrics are found for each Training_Size and stored in lda_data along with the Data variable ("Resampling", "Training" and, optionally, "Testing").

Here is the link to the function documentation: https://rdrr.io/cran/caret/man/learing_curve_dat.html

To be clear, this answers the first part of the question, but not the second part.

0
source

Source: https://habr.com/ru/post/959311/


All Articles