I am trying to work on a 32-level randomForest package limit for factors.
I have a data set with 100 levels in one of the factor variables.
I wrote the following code to see how things look, using replacement sampling and how many attempts it will take to get certain% of the selected levels.
sampAll <- c() nums1 <- seq(1,102,1) for(i in 1:20){ samp1 <- sample(nums1, 32) sampAll <- unique(cbind(sampAll, samp1)) outSamp1 <- nums1[-(sampAll[,1:ncol(sampAll)])] print(paste(i, " | Remaining: ",length(outSamp1)/102,sep="")) flush.console() } [1] "1 | Remaining: 0.686274509803922" [1] "2 | Remaining: 0.490196078431373" [1] "3 | Remaining: 0.333333333333333" [1] "4 | Remaining: 0.254901960784314" [1] "5 | Remaining: 0.215686274509804" [1] "6 | Remaining: 0.147058823529412" [1] "7 | Remaining: 0.117647058823529" [1] "8 | Remaining: 0.0980392156862745" [1] "9 | Remaining: 0.0784313725490196" [1] "10 | Remaining: 0.0784313725490196" [1] "11 | Remaining: 0.0490196078431373" [1] "12 | Remaining: 0.0294117647058824" [1] "13 | Remaining: 0.0196078431372549" [1] "14 | Remaining: 0.00980392156862745" [1] "15 | Remaining: 0.00980392156862745" [1] "16 | Remaining: 0.00980392156862745" [1] "17 | Remaining: 0.00980392156862745" [1] "18 | Remaining: 0" [1] "19 | Remaining: 0" [1] "20 | Remaining: 0"
What I'm discussing is sampling with or without replacement.
I think about:
- obtaining a sample of 32 out of 100 factors,
- using these lines to run randomForest,
- test set prediction using randomForest and
- repeating this process either (a) 3 (WITHOUT replacement) or (b) 10-15 times (with replacement).
- taking 3 or 10-15 predicted values, finding the average value and using it as a final predictor.
I am wondering if someone has tried something like this or if I am breaking any rules (introducing prejudice, etc.), or if someone has suggestions.
Note: I cross-posed this question on Stats-Overflow / Cross-Validated.
source share