I use the randomForest package in R (R version 2.13.1, randomForest version 4.6-2) for regression and noticed a significant bias in my results: the prediction error depends on the value of the response variable. High values ββare forecasted, and low values ββare forecasted. At first I suspected that this was a consequence of my data, but the following simple example shows that this is inherent in the random forest algorithm:
n = 50; x1 = seq(1,n) x2 = matrix(1, n, 1) predictors = data.frame(x1=x1, x2=x2) response = x2 + x1 rf = randomForest(x=predictors, y=response) plot(x1, response) lines(x1, predict(rf, predictors), col="red")
No doubt, tree-based methods have their limitations when it comes to linearity, but even to the simplest regression tree, for example. tree () in R does not show this bias. I canβt imagine that the community would not be aware of this, but could not find any mention of how this was generally fixed? Thanks for any comments.
EDIT: An example for this question is erroneous, see "RandomForest for regression depending on the distribution of the distribution of the R-response" in stack exchange for better treatment https://stats.stackexchange.com/questions/28732/randomforest-for-regression-in -r-response-distribution-dependent-bias
source share