R to bypass the random limit of the RandomForest 32 factor

I am trying to work on a 32-level randomForest package limit for factors.

I have a data set with 100 levels in one of the factor variables.

I wrote the following code to see how things look, using replacement sampling and how many attempts it will take to get certain% of the selected levels.

sampAll <- c() nums1 <- seq(1,102,1) for(i in 1:20){ samp1 <- sample(nums1, 32) sampAll <- unique(cbind(sampAll, samp1)) outSamp1 <- nums1[-(sampAll[,1:ncol(sampAll)])] print(paste(i, " | Remaining: ",length(outSamp1)/102,sep="")) flush.console() } [1] "1 | Remaining: 0.686274509803922" [1] "2 | Remaining: 0.490196078431373" [1] "3 | Remaining: 0.333333333333333" [1] "4 | Remaining: 0.254901960784314" [1] "5 | Remaining: 0.215686274509804" [1] "6 | Remaining: 0.147058823529412" [1] "7 | Remaining: 0.117647058823529" [1] "8 | Remaining: 0.0980392156862745" [1] "9 | Remaining: 0.0784313725490196" [1] "10 | Remaining: 0.0784313725490196" [1] "11 | Remaining: 0.0490196078431373" [1] "12 | Remaining: 0.0294117647058824" [1] "13 | Remaining: 0.0196078431372549" [1] "14 | Remaining: 0.00980392156862745" [1] "15 | Remaining: 0.00980392156862745" [1] "16 | Remaining: 0.00980392156862745" [1] "17 | Remaining: 0.00980392156862745" [1] "18 | Remaining: 0" [1] "19 | Remaining: 0" [1] "20 | Remaining: 0" 

What I'm discussing is sampling with or without replacement.

I think about:

  • obtaining a sample of 32 out of 100 factors,
  • using these lines to run randomForest,
  • test set prediction using randomForest and
  • repeating this process either (a) 3 (WITHOUT replacement) or (b) 10-15 times (with replacement).
  • taking 3 or 10-15 predicted values, finding the average value and using it as a final predictor.

I am wondering if someone has tried something like this or if I am breaking any rules (introducing prejudice, etc.), or if someone has suggestions.

Note: I cross-posed this question on Stats-Overflow / Cross-Validated.

+4
source share
2 answers

I could recommend two ways:

  • You can convert a 100-digit variable to 100 binary variables. Each of them will represent one initial level (0 - false, 1 - true). Thus, you can work with the entire data set and create a random forest model. But in this case, the memory consumption of your data set will increase, and you may have to use some additional packages to work with huge data sets.

  • The second option is to make many samples of the original dataset with replacement. Because if you share the data set without replacement, you will have bias in the model. But, nevertheless, I think you will need to do more than 10-15 splits to avoid bias. I can’t say exactly how much. Maybe a few hundred or more. It depends on your data set. Because, if the number of objects in each of the 100 levels is significantly different, then after splitting you will get samples of a significantly different size, and this can affect the predictive ability of the model. In this case, the number of splits should be increased.

0
source

You can also split your 100-level variable into 4 separate variables, each with 25 levels. This will lead to complex problems with an alias with a linear model, but you do not worry about it with a random forest.

+3
source

Source: https://habr.com/ru/post/1389849/


All Articles