Missed Missing Values ​​with Caret

I am working on a Kaggle Titanic competition and I have a question regarding the imputation of missing values. I am trying to use Caret, and my training set consists of factors as well as numbers.

I want to use the preProcess function in Caret to indicate missing values, but before using preProcess I need to convert all my factors into dummy variables using the dummyVars function.

 dummies = dummyVars(survived ~ . -1, data = train, na.action = na.pass) xtrain = predict(dummies, train) 

However, in the process of using dummyVars to convert factors, all NAs are predicted by some unknown algorithm, and the missing age columns become equal to 1, although I specified na.action = na.pass . I want to convert my factors into dummy variables WITHOUT when NAs are touched, so I can use, then use the preProcess function to impose them. How can i do this?

Thanks.

dput here:

 structure(list(survived = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("0", "1"), class = "factor"), pclass = structure(c(3L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 3L ), .Label = c("1", "2", "3"), class = "factor"), sex = structure(c(2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("female", "male"), class = "factor"), age = c(22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, NA, 31, NA), sibsp = c(1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0), parch = c(0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0), fare = c(7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 16.7, 26.55, 8.05, 31.275, 7.8542, 16, 29.125, 13, 18, 7.225), embarked = structure(c(4L, 2L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 2L), .Label = c("", "C", "Q", "S"), class = "factor")), .Names = c("survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"), row.names = c(NA, 20L), class = "data.frame") 
+6
source share
1 answer

This first part is a mistake; NA values ​​should not be 1 (obviously). In the meantime, you can use model.matrix to create dummy variables, but you may have to do this immediately for all the data. Alternatively, if you use train , you can use the formula method. Overall, this is the best approach.

I will fix this in the next few weeks. I'm going to release a carriage version, and that, plus UseR, will delay me a bit.

EDIT: A new version will be released next week that fixes a bug

Max

+4
source

Source: https://habr.com/ru/post/947725/


All Articles