How to create a prediction interval from the rpart object of a regression tree?

How do you create a prediction interval from a regression tree that fits with rpart?

I understand that the regression tree models the response to the average value for leaf nodes. I don’t know how to get the variance for the node sheet from the model, but what I would like to do is simulate using the average and variance for the node sheet to get the prediction interval.

Predict.rpart () does not provide an option for an interval.

Example: I fit the tree to the iris data, but the prediction has no option, "interval"

> r1 <- rpart(Sepal.Length ~ ., cp = 0.001, data = iris[1:nrow(iris)-1,]) > predict(r1,newdata=iris[nrow(iris),],type = "interval") Error in match.arg(type) : 'arg' should be one of "vector", "prob", "class", "matrix" 
+9
source share
2 answers

It is not clear to me what confidence intervals will mean for regression trees, since these are not classical statistical models, like linear models. And I see basically two types of use: characterizing the certainty of your tree or characterizing the accuracy of the prediction for each leaf of the tree. In the future, the answer to each of these possibilities.

Characterizing the certainty of your tree

If you are looking for a trust value for a split node, then party provides this directly because it uses permutation tests and statistically determines which variables are most important and the p value associated with each split. The significant superiority of the party ctree function over rpart as explained here .

Confidence Intervals for a Set of Regression Tree Leaves

Third, if you are looking for the confidence interval for the value in each sheet, then the [0,025,0,975] quantile interval for observations in the sheet is most likely what you are looking for. By default, party uses a similar approach when displaying boxes for the output value for each sheet:

 library("party") r2 <- ctree(Sepal.Length ~ .,data=iris) plot(r2) 

example party tree

The corresponding intervals can be obtained simply:

 iris$leaf <- predict(r2, type="node") CIleaf <- aggregate(iris$Sepal.Length, by=list(leaf=iris$leaf), quantile, prob=c(0.025, 0.25, 0.75, 0.975)) 

And easy to visualize:

 plot(as.factor(CIleaf$leaf), CIleaf[, 2], ylab="Sepal length", xlab="Regression tree leaf") legend("bottomright", c(" 0.975 quantile", " 0.75 quantile", " mean", " 0.25 quantile", " 0.025 quantile"), pch=c("-", "_", "_", "_", "-"), pt.lwd=0.5, pt.cex=c(1, 1, 2, 1, 1), xjust=1) 

Sepal length variance per regression tree leaf

+8
source

Perhaps one option is to simply bootstrap your training data?

 library(rpart) library(boot) trainData <- iris[-150L, ] predictData <- iris[150L, ] rboot <- boot(trainData, function(data, idx) { bootstrapData <- data[idx, ] r1 <- rpart(Sepal.Length ~ ., bootstrapData, cp = 0.001) predict(r1, newdata = predictData) }, 1000L) quantile(rboot$t, c(0.025, 0.975)) 2.5% 97.5% 5.871393 6.766842 
+2
source

Source: https://habr.com/ru/post/983978/


All Articles