I want to programmatically test one rule generated from a tree. In trees, the path between the root and the leaf (terminal node) can be interpreted as a rule.
In R, we can use the rpart package and do the following: (In this post I will use the iris dataset, for example, for purposes only)
library(rpart) model <- rpart(Species ~ ., data=iris)
With these two lines, I got a tree named model , whose class is rpart.object ( rpart documentation, p. 21). This object has a lot of information and supports many methods. In particular, the object has a frame variable (which can be accessed in the standard way: model$frame ) (idem) and the path.rpath method ( rpart documentation, page 7), which gives you the path from the root of the node to the node of interest ( node argument in functions)
row.names frame variable contains the node number of the tree. The var column provides the split variable in node, yval set value, and yval2 is the class probability and other information.
> model$frame var n wt dev yval complexity ncompete nsurrogate yval2.1 yval2.2 yval2.3 yval2.4 yval2.5 yval2.6 yval2.7 1 Petal.Length 150 150 100 1 0.50 3 3 1.00000000 50.00000000 50.00000000 50.00000000 0.33333333 0.33333333 0.33333333 2 <leaf> 50 50 0 1 0.01 0 0 1.00000000 50.00000000 0.00000000 0.00000000 1.00000000 0.00000000 0.00000000 3 Petal.Width 100 100 50 2 0.44 3 3 2.00000000 0.00000000 50.00000000 50.00000000 0.00000000 0.50000000 0.50000000 6 <leaf> 54 54 5 2 0.00 0 0 2.00000000 0.00000000 49.00000000 5.00000000 0.00000000 0.90740741 0.09259259 7 <leaf> 46 46 1 3 0.01 0 0 3.00000000 0.00000000 1.00000000 45.00000000 0.00000000 0.02173913 0.97826087
But only those marked as <leaf> in the var column are terminal nodes (sheets). In this case, nodes 2, 6, and 7.
As mentioned above, you can use the path.rpart method to extract the rule (this method is used in the rattle package and in the Sharma Credit Score article, as follows:
In addition, the model stores the values โโof the predicted value in
predicted.levels <- attr(model, "ylevels")
This value corresponds to the yval column in the model$frame dataset.
For a sheet with node number 7 (line No. 5), the predicted value
> ylevels[model$frame[5, ]$yval] [1] "virginica"
and rule
> rule <- path.rpart(model, nodes = 7) node number: 7 root Petal.Length>=2.45 Petal.Width>=1.75
So the rule can be read as
If Petal.Length >= 2.45 AND Petal.Width >= 1.75 THEN Species = Virginica
I know that I can check (in the test dataset I use the diaphragm dataset again) how many true positive results I have for this rule, a subset of the new dataset as follows
> hits <- subset(iris, Petal.Length >= 2.45 & Petal.Width >= 1.75)
and then computing the confusion matrix
> table(hits$Species, hits$Species == "virginica") FALSE TRUE setosa 0 0 versicolor 1 0 virginica 0 45
(Note: I used the same aperture dataset as the test)
How can I correctly evaluate the rule? I could extract the conditions from the rule as follows
> unlist(rule, use.names = FALSE)[-1] [1] "Petal.Length>=2.45" "Petal.Width>=1.75"
But how can I continue here? I can not use the subset function
Thanks in advance
NOTE. This question has been heavily edited for clarity.