R choice of carriage variation / rfe for coefficients () and NA

I have a dataset with NAs .

In addition, it has columns that should be factors() .

I use the rfe() function from the caret package to select variables.

It seems that the functions= argument in rfe() using lmFuncs works for data with NA, but NOT for factor variables, and rfFuncs works for variable factors, but NOT NA.

Any suggestions on this?

I tried model.matrix() , but it seems to have caused more problems.

+3
source share
1 answer

Due to the incompatible behavior at these points between packages, not to mention the additional trick when switching to more “meta packages” such as caret , it’s always easier for me to deal with NA and factorial variables before I do any machine learning .

  • For NAs, omit or enter (median, knn, etc.).
  • For factor functions, you are on the right track with model.matrix() . This will allow you to generate a series of "dummy" functions for different factor levels. A typical use looks something like this:
 > dat = data.frame(x=factor(rep(1:3, each=5))) > dat$x [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Levels: 1 2 3 > model.matrix(~ x - 1, data=dat) x1 x2 x3 1 1 0 0 2 1 0 0 3 1 0 0 4 1 0 0 5 1 0 0 6 0 1 0 7 0 1 0 8 0 1 0 9 0 1 0 10 0 1 0 11 0 0 1 12 0 0 1 13 0 0 1 14 0 0 1 15 0 0 1 attr(,"assign") [1] 1 1 1 attr(,"contrasts") attr(,"contrasts")$x [1] "contr.treatment" 

Also, just in case you don’t have (although it looks like you have), the caret vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html

+3
source

Source: https://habr.com/ru/post/1396078/


All Articles