I have a dataset in which I use the model.matrix() function to convert variable factors into dummy variables. My data has 10 columns like these, with three levels (2,3,4), and I created dummy variables for each of them separately.
xFormData <- function(dataset){ mm0 <- model.matrix(~ factor(dataset$type) , data=dataset) mm1 <- model.matrix(~ factor(dataset$type_last1), data = dataset) mm2 <- model.matrix(~ factor(dataset$type_last2), data = dataset) mm3 <- model.matrix(~ factor(dataset$type_last3), data = dataset) mm4 <- model.matrix(~ factor(dataset$type_last4), data = dataset) mm5 <- model.matrix(~ factor(dataset$type_last5), data = dataset) mm6 <- model.matrix(~ factor(dataset$type_last6), data = dataset) mm7 <- model.matrix(~ factor(dataset$type_last7), data = dataset) mm8 <- model.matrix(~ factor(dataset$type_last8), data = dataset) mm9 <- model.matrix(~ factor(dataset$type_last9), data = dataset) mm10 <- model.matrix(~ factor(dataset$type_last10), data = dataset) dataset <- cbind(dataset, mm0, mm1, mm2, mm3, mm4, mm5, mm6, mm7, mm8, mm9, mm10) dataset }
I am wondering if this is the wrong procedure, because after running randomForest on the data and plotting the value of the variable, it showed the different columns of the variable variable separately. Since columns 61-63 were 3 dummy variables for column 10, randomForest sees column 62 itself as an important predictor.
I have 2 questions:
1) Is this normal?
2) If not, how can I group dummy variables so that rf knows that they are together?
source share