Configure glmnet hyperparameters and evaluate performance with nested cross validation in mlr?

I am trying to use the R ml ml package to train the glmnet model on the binary classification problem with a large data set (about 850,000 lines and about 100 functions) on very modest equipment (my laptop with 4 GB of RAM --- I do not have access to more muscle CPU). I decided to use mlr because I need to use nested cross-validation to tune the hyperparameters of my classifier and evaluate the expected performance of the final model. As far as I know, neither carriages nor h2o offer nested cross validation at present, but mlr provides the infrastructure for this. However, I find the sheer number of functions provided by mlr is extremely overwhelming, and it's hard to figure out how to put everything together to achieve my goal.What's happening? How do they fit together? I read all the documentation here:https://mlr-org.imtqy.com/mlr-tutorial/release/html/ and I'm still confused. There are code snippets that show how to do certain things, but it’s not clear (to me) how to stitch them together. What a big picture? I was looking for a complete processed example for use as a template and found only this: https://www.bioconductor.org/help/course-materials/2015/CSAMA2015/lab/classification.html , which I used as a starting point. Can anyone help fill in the blanks?

Here is what I want to do:

Configure hyperparameters (regularization parameters l1 and l2) of the glmnet model using grid search or arbitrary grid search (or something faster if it exists), F-racing iteration? Adaptive resampling?) And stratified cross-validation k-fold internal loop, with an external cross-validation loop to evaluate expected final performance. I want to include a function preprocessing step in the inner loop with centering, scaling and transforming Yeo-Johnson and a quick filter-based function selection (the latter is necessary because I have very modest hardware and I need to reduce the function space to reduce the learning time). I have unbalanced classes (the positive class is about 20%), so I decided to use AUC as my optimization goal,but this is only a surrogate for a real metric of interest, with a false positive for a small number of true positive fixed points (i.e. I want to know FPR for TPR = 0.6, 0.7, 0.8). I would like to set probability thresholds to achieve these TPRs and note that this is possible in a nested CV, but it is not clear what exactly is optimized here: https://github.com/mlr-org/mlr/issues/856 I would like to know where the cut should be without information leakage, so I want to select this using CV.

I use glmnet because I would rather spend my processor cycles creating a reliable model than a fancy model that gives too optimistic results. GBM or Random Forest can be done later if I find that it can be done fast enough, but I do not expect the functions in my data to be informative enough to invest a lot of time in learning something especially complicated.

, , , , glmnet --- , , , LASSO.

, !

:

df <- as.data.frame(DT)

task <- makeClassifTask(id = "glmnet", 
                        data = df, 
                        target = "Flavour", 
                        positive = "quark")
task


lrn <- makeLearner("classif.glmnet", predict.type = "prob")
lrn

# Feature preprocessing -- want to do this as part of CV:
lrn <- makePreprocWrapperCaret(lrn,
                               ppc.center = TRUE, 
                               ppc.scale = TRUE,
                               ppc.YeoJohnson = TRUE)
lrn

# I want to use the implementation of info gain in CORElearn, not Weka:
infGain = makeFilter(
  name = "InfGain",
  desc = "Information gain ",
  pkg  = "CORElearn",
  supported.tasks = c("classif", "regr"),
  supported.features = c("numerics", "factors"),
  fun = function(task, nselect, ...) {
    CORElearn::attrEval(
      getTaskFormula(task), 
      data = getTaskData(task), estimator = "InfGain", ...)
  }
)
infGain

# Take top 20 features:
lrn <-  makeFilterWrapper(lrn, fw.method = "InfGain", fw.abs = 20)
lrn

# Now things start to get foggy...

tuningLrn <- makeTuneWrapper(
  lrn, 
  resampling = makeResampleDesc("CV", iters = 2,  stratify = TRUE), 
  par.set = makeParamSet(
    makeNumericParam("s", lower = 0.001, upper = 0.1),
    makeNumericParam("alpha", lower = 0.0, upper = 1.0)
  ), 
  control = makeTuneControlGrid(resolution = 2)
)

r2 <- resample(learner = tuningLrn, 
               task = task, 
               resampling = rdesc, 
               measures = auc)
# Now what...?
+4

Source: https://habr.com/ru/post/1652914/


All Articles