Can I predict .glmnet from test data with different numbers of predictor variables?

Question

Can I predict .glmnet from test data with different numbers of predictor variables?

I used glmnet to create a predictive model on a training set with ~ 200 predictors and 100 samples for the binomial regression / classification problem.

I chose the best model (16 predictors) that gave me the maximum AUC. I have an independent test suite with only these variables (16 predictors), which introduced it into the final model from the training set.

Is there any way to use .glmnet prediction based on the optimal model from the training set with a new test set that contains data only for those variables that contributed it to the final model from the training set?

+4

r glmnet

user1407875 Aug 24 '13 at 16:20

source share

1 answer

NiuBiBang · Answer 1 · 2014-08-02T14:09:18+0000

glmnet requires the exact number / names of variables from the training data set, which should be in the test / test set. For instance:

 library(caret) library(glmnet) df <- ... # a dataframe with 200 variables, some of which you want to predict on # & some of which you don't care about. # Variable 13 ('Response.Variable') is the dependent variable. # Variables 1-12 & 14-113 are the predictor variables # All training/testing & validation datasets are derived from this single df. # Split dataframe into training & testing sets inTrain <- createDataPartition(df$Response.Variable, p = .75, list = FALSE) Train <- df[ inTrain, ] # Training dataset for all model development Test <- df[ -inTrain, ] # Final sample for model validation # Run logistic regression , using only specified predictor variables logCV <- cv.glmnet(x = data.matrix(Train[, c(1:12,14:113)]), y = Train[,13], family = 'binomial', type.measure = 'auc') # Test model over final test set, using specified predictor variables # Create field in dataset that contains predicted values Test$prob <- predict(logCV,type="response", newx = data.matrix(Test[, c(1:12,14:113) ]), s = 'lambda.min')

For a completely new dataset, you can limit the new df to the necessary variables using some variant of the following method:

 new.df <- ... # new df w/ 1,000 variables, which include all predictor variables used # in developing the model # Create object with requisite predictor variable names that we specified in the model predictvars <- c('PredictorVar1', 'PredictorVar2', 'PredictorVar3', ... 'PredictorVarK') new.df$prob <- predict(logCV,type="response", newx = data.matrix(new.df[names(new.df) %in% predictvars ]), s = 'lambda.min') # the above method limits the new df of 1,000 variables to # whatever the requisite variable names or indices go into the # model.

In addition, glmnet only works with matrices. This is probably why you get an error message that you post in a comment on your question. Some users (including me) have found that as.matrix() does not solve the problem; data.matrix() seems to work though (hence why this is in the above code). This issue is addressed in a stream or two in SO.

I assume that all the variables in the new dataset that should be predicted should also be formatted in the same way as in the dataset used to develop the model. Usually I retrieve all my data from a single source, so I have not come across what glmnet will do in cases where the formatting is different.

Can I predict .glmnet from test data with different numbers of predictor variables?

More articles: