Cv.glmnet does not work for ridge, not for lasso, for modeling data with encoder error

Gist

Error: Error in predmat[which, seq(nlami)] = preds : replacement has length zero

Context: data is modeled with binary y, but there are n true y encoders. data is added n times, and the model is installed, trying to get true y .

Error received for

  • L2 fine, but not L1 fine.
  • when Y is the encoder of Y, but not when it is true Y.
  • the error is not determined, but depends on the seed.

UPDATE: error for versions after 1.9-8. 1.9.8 is not interrupted.

Play

basic data:

 library(glmnet) rm(list=ls()) set.seed(123) num_obs=4000 n_coders=2 precision=.8 X <- matrix(rnorm(num_obs*20, sd=1), nrow=num_obs) prob1 <- plogis(X %*% c(2, -2, 1, -1, rep(0, 16))) # yes many zeros, ignore y_true <- rbinom(num_obs, 1, prob1) dat <- data.frame(y_true = y_true, X = X) 

create encoders

 classify <- function(true_y,precision){ n=length(true_y) y_coder <- numeric(n) y_coder[which(true_y==1)] <- rbinom(n=length(which(true_y==1)), size=1,prob=precision) y_coder[which(true_y==0)] <- rbinom(n=length(which(true_y==0)), size=1,prob=(1-precision)) return(y_coder) } y_codings <- sapply(rep(precision,n_coders),classify,true_y = dat$y_true) 

collect everything

 expanded_data <- do.call(rbind,rep(list(dat),n_coders)) expanded_data$y_codings <- matrix(y_codings, ncol = 1) 

reproduce the error

Since the error depends on the seed, a loop is needed. only the first cycle will fail, the other two will be completed.

 X <- as.matrix(expanded_data[,grep("X",names(expanded_data))]) for (i in 1:1000) cv.glmnet(x = X,y = expanded_data$y_codings, family="binomial", alpha=0) # will fail for (i in 1:1000) cv.glmnet(x = X,y = expanded_data$y_codings, family="binomial", alpha=1) # will not fail for (i in 1:1000) cv.glmnet(x = X,y = expanded_data$y_true, family="binomial", alpha=0) # will not fail 

Any thoughts where this comes from in glmnet and how to avoid it? from my reading of cv.glmnet , this is after the cv procedure and is inside cvstuff = do.call(fun, list(outlist, lambda, x, y, weights, offset, foldid, type.measure, grouped, keep)) , which I do not understand its role, therefore, failure and how to avoid it.

Sessions (Ubuntu and PC)

 R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.1 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] glmnet_2.0-2 foreach_1.4.3 Matrix_1.2-7.1 devtools_1.12.0 loaded via a namespace (and not attached): [1] httr_1.2.1 R6_2.2.0 tools_3.3.1 withr_1.0.2 curl_2.1 [6] memoise_1.0.0 codetools_0.2-15 grid_3.3.1 iterators_1.0.8 knitr_1.14 [11] digest_0.6.10 lattice_0.20-34 

and

 R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] glmnet_2.0-2 foreach_1.4.3 Matrix_1.2-7.1 devtools_1.12.0 loaded via a namespace (and not attached): [1] httr_1.2.1 R6_2.2.0 tools_3.3.1 withr_1.0.2 curl_2.1 [6] memoise_1.0.0 codetools_0.2-15 grid_3.3.1 iterators_1.0.8 digest_0.6.10 [11] lattice_0.20-34 
+6
source share
2 answers

I had the same error in glmnet_2.0-5 This is somehow related to the way lambdas are automatically created in some situations. The solution is to provide your own lambdas

eg:

 cv.glmnet(x = X, y = expanded_data$y_codings, family="binomial", alpha=0, lambda=exp(seq(log(0.001), log(5), length.out=100))) 

Kudos https://github.com/lmweber/glmnet-error-example/blob/master/glmnet_error_example.R

+3
source

Well, I just started the first cycle and it completed successfully. This is with glmnet 2.0.2.

This is more of a comment, but too big to fit: when you run tests that depend on random numbers, you can save the seed as you go. This allows you to go to the middle of testing without returning to launch every time.

Something like that:

 results <- lapply(1:1000, function(x) { seed <- .Random.seed res <- try(glmnet(x, y, ...)) # so the code keeps running even if there an error attr(res, "seed") <- seed res }) 

Now you can check if any of the runs passed by looking at the result class:

 errs <- sapply(results, function(x) inherits(x, "try-error")) any(errs) 

And you can repeat those runs that failed:

 firstErr <- which(errs)[1] .Random.seed <- attr(results[[firstErr]], "seed") glmnet(x, y, ...) # try failed run again 

Session Information:

 R version 3.2.2 (2015-08-14) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.850 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] glmnetUtils_0.55 RevoUtilsMath_8.0.3 RevoUtils_8.0.3 RevoMods_8.0.3 RevoScaleR_8.0.6 [6] lattice_0.20-33 rpart_4.1-10 loaded via a namespace (and not attached): [1] Matrix_1.2-2 parallel_3.2.2 codetools_0.2-14 rtvs_1.0.0.0 grid_3.2.2 [6] iterators_1.0.8 foreach_1.4.3 glmnet_2.0-2 

(It should be Windows 10, not 8; R 3.2.2 does not know about Win10)

+2
source

Source: https://habr.com/ru/post/1011496/


All Articles