Setting up R model.matrix

I have a dataset in which I use the model.matrix() function to convert variable factors into dummy variables. My data has 10 columns like these, with three levels (2,3,4), and I created dummy variables for each of them separately.

 xFormData <- function(dataset){ mm0 <- model.matrix(~ factor(dataset$type) , data=dataset) mm1 <- model.matrix(~ factor(dataset$type_last1), data = dataset) mm2 <- model.matrix(~ factor(dataset$type_last2), data = dataset) mm3 <- model.matrix(~ factor(dataset$type_last3), data = dataset) mm4 <- model.matrix(~ factor(dataset$type_last4), data = dataset) mm5 <- model.matrix(~ factor(dataset$type_last5), data = dataset) mm6 <- model.matrix(~ factor(dataset$type_last6), data = dataset) mm7 <- model.matrix(~ factor(dataset$type_last7), data = dataset) mm8 <- model.matrix(~ factor(dataset$type_last8), data = dataset) mm9 <- model.matrix(~ factor(dataset$type_last9), data = dataset) mm10 <- model.matrix(~ factor(dataset$type_last10), data = dataset) dataset <- cbind(dataset, mm0, mm1, mm2, mm3, mm4, mm5, mm6, mm7, mm8, mm9, mm10) dataset } 

I am wondering if this is the wrong procedure, because after running randomForest on the data and plotting the value of the variable, it showed the different columns of the variable variable separately. Since columns 61-63 were 3 dummy variables for column 10, randomForest sees column 62 itself as an important predictor.

I have 2 questions:

1) Is this normal?

2) If not, how can I group dummy variables so that rf knows that they are together?

+4
source share
1 answer

This is normal, and still happens behind the scenes if you leave factors as factors. Different factor levels are different functions for most machine learning purposes. Think of a random example, such as a test outcome ~ school : Maybe school A is very predictive whether you pass the test or not, but not school B or school C. Then the function of school A will be useful, but not others.

This is described in one of the caret vignette documents: http://cran.r-project.org/web/packages/caret/vignettes/caretMisc.pdf

In addition, the cars dataset included in caret should be a useful example. It contains 2 factors - the โ€œmanufacturer" and the "type of car" - which were fictitiously encoded into a series of numerical attributes for machine learning purposes.

 data(cars, package='caret') head(cars) 
+3
source

Source: https://habr.com/ru/post/1396076/


All Articles