Due to the incompatible behavior at these points between packages, not to mention the additional trick when switching to more “meta packages” such as caret , it’s always easier for me to deal with NA and factorial variables before I do any machine learning .
- For NAs, omit or enter (median, knn, etc.).
- For factor functions, you are on the right track with
model.matrix() . This will allow you to generate a series of "dummy" functions for different factor levels. A typical use looks something like this:
> dat = data.frame(x=factor(rep(1:3, each=5))) > dat$x [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Levels: 1 2 3 > model.matrix(~ x - 1, data=dat) x1 x2 x3 1 1 0 0 2 1 0 0 3 1 0 0 4 1 0 0 5 1 0 0 6 0 1 0 7 0 1 0 8 0 1 0 9 0 1 0 10 0 1 0 11 0 0 1 12 0 0 1 13 0 0 1 14 0 0 1 15 0 0 1 attr(,"assign") [1] 1 1 1 attr(,"contrasts") attr(,"contrasts")$x [1] "contr.treatment"
Also, just in case you don’t have (although it looks like you have), the caret vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html
source share