RandomForest's R linear regression tails mtry

I use the randomForest package in R (R version 2.13.1, randomForest version 4.6-2) for regression and noticed a significant bias in my results: the prediction error depends on the value of the response variable. High values ​​are forecasted, and low values ​​are forecasted. At first I suspected that this was a consequence of my data, but the following simple example shows that this is inherent in the random forest algorithm:

n = 50; x1 = seq(1,n) x2 = matrix(1, n, 1) predictors = data.frame(x1=x1, x2=x2) response = x2 + x1 rf = randomForest(x=predictors, y=response) plot(x1, response) lines(x1, predict(rf, predictors), col="red") 

No doubt, tree-based methods have their limitations when it comes to linearity, but even to the simplest regression tree, for example. tree () in R does not show this bias. I can’t imagine that the community would not be aware of this, but could not find any mention of how this was generally fixed? Thanks for any comments.

EDIT: An example for this question is erroneous, see "RandomForest for regression depending on the distribution of the distribution of the R-response" in stack exchange for better treatment https://stats.stackexchange.com/questions/28732/randomforest-for-regression-in -r-response-distribution-dependent-bias

+6
source share
1 answer

What you discovered is not an inherent bias in random forests, but simply unable to properly configure settings on the model.

Using the example data:

 rf = randomForest(x=predictors, y=response,mtry = 2,nodesize = 1) plot(x1, response) lines(x1, predict(rf, predictors), col="red") 

enter image description here

For your real data, the improvement is unlikely to be so dramatic, of course, and I bet you would get more mileage from nodesize than mtry ( mtry did most of the work here).

The reason ordinary trees did not exhibit this β€œbias” is because they search by default for all variables for better separation.

+5
source

Source: https://habr.com/ru/post/915244/


All Articles