Getting observations in rpart node (i.e.: CART)

I would like to check all the observations that reached some node in the rpart decision tree. For example, in the following code:

fit <- rpart(Kyphosis ~ Age + Start, data = kyphosis) fit n= 81 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 81 17 absent (0.79012346 0.20987654) 2) Start>=8.5 62 6 absent (0.90322581 0.09677419) 4) Start>=14.5 29 0 absent (1.00000000 0.00000000) * 5) Start< 14.5 33 6 absent (0.81818182 0.18181818) 10) Age< 55 12 0 absent (1.00000000 0.00000000) * 11) Age>=55 21 6 absent (0.71428571 0.28571429) 22) Age>=111 14 2 absent (0.85714286 0.14285714) * 23) Age< 111 7 3 present (0.42857143 0.57142857) * 3) Start< 8.5 19 8 present (0.42105263 0.57894737) * 

I would like to see all the observations in node (5) (i.e. 33 cases for which Start> = 8.5 and Start <14.5). Obviously, I could manually get to them. But I would like to have some function like (say) "get_node_date". Why could I just run get_node_date (5) - and get the corresponding observations.

Any suggestions on how to do this?

+5
source share
4 answers

It seems that there is no function that allows you to extract observations from a specific node. I would solve this as follows: first determine which rule / s is used / used for the node you are installed into. You can use path.rpart for it. You can then apply the / s rule one by one to extract the observations.

This approach as a function:

 get_node_date <- function(tree = fit, node = 5){ rule <- path.rpart(tree, node) rule_2 <- sapply(rule[[1]][-1], function(x) strsplit(x, '(?<=[><=])(?=[^><=])|(?<=[^><=])(?=[><=])', perl = TRUE)) ind <- apply(do.call(cbind, lapply(rule_2, function(x) eval(call(x[2], kyphosis[,x[1]], as.numeric(x[3]))))), 1, all) kyphosis[ind,] } 

For node 5 you get:

 get_node_date() node number: 5 root Start>=8.5 Start< 14.5 Kyphosis Age Number Start 2 absent 158 3 14 10 present 59 6 12 11 present 82 5 14 14 absent 1 4 12 18 absent 175 5 13 20 absent 27 4 9 23 present 96 3 12 26 absent 9 5 13 28 absent 100 3 14 32 absent 125 2 11 33 absent 130 5 13 35 absent 140 5 11 37 absent 1 3 9 39 absent 20 6 9 40 present 91 5 12 42 absent 35 3 13 46 present 139 3 10 48 absent 131 5 13 50 absent 177 2 14 51 absent 68 5 10 57 absent 2 3 13 59 absent 51 7 9 60 absent 102 3 13 66 absent 17 4 10 68 absent 159 4 13 69 absent 18 4 11 71 absent 158 5 14 72 absent 127 4 12 74 absent 206 4 10 77 present 157 3 13 78 absent 26 7 13 79 absent 120 2 13 81 absent 36 4 13 
+1
source

rpart returns an rpart.object element that contains the necessary information:

 require(rpart) fit2 <- rpart(Kyphosis ~ Age + Start, data = kyphosis) fit2 get_node_date <-function(nodeId,fit) { fit$frame[toString(nodeId),"n"] } for (i in c(1,2,4,5,10,11,22,23,3) ) cat(get_node_date(i,fit2),"\n") 
+1
source

partykit also provides a ready-made solution for this. You just need to convert the rpart object to the party class to use its unified interface for working with trees. And then you can use the data_party() function.

Using fit from the question and loading library("partykit") , you can first collapse the rpart tree to party :

 pfit <- as.party(fit) plot(pfit) 

full pfit tree

There are only two minor troubles for retrieving data the way you want: (1) model.frame() always forcibly rejected from the original fit and must be manually reconnected. (2) A different numbering scheme is used for nodes. You want node 4 (not 5).

 pfit$data <- model.frame(fit) data4 <- data_party(pfit, 4) dim(data4) ## [1] 33 5 head(data4) ## Kyphosis Age Start (fitted) (response) ## 2 absent 158 14 7 absent ## 10 present 59 12 8 present ## 11 present 82 14 8 present ## 14 absent 1 12 5 absent ## 18 absent 175 13 7 absent ## 20 absent 27 9 5 absent 

Another route is a subset of the subtree, starting with node 4, and then taking data from this:

 pfit4 <- pfit[4] plot(pfit4) 

node 4 pit subtree

Then data_party(pfit4) gives you the same thing as data4 above. And pfit4$data provides data without a (fitted) node and a predicted (response) .

+1
source

Another way is to find all the terminal nodes of any particular node and return a subset of the data used in the call.

 fit <- rpart(Kyphosis ~ Age + Start, data = kyphosis) head(subset.rpart(fit, 5)) # Kyphosis Age Number Start # 2 absent 158 3 14 # 10 present 59 6 12 # 11 present 82 5 14 # 14 absent 1 4 12 # 18 absent 175 5 13 # 20 absent 27 4 9 subset.rpart <- function(tree, node = 1L) { data <- eval(tree$call$data, parent.frame(1L)) wh <- sapply(as.integer(rownames(tree$frame)), parent) wh <- unique(unlist(wh[sapply(wh, function(x) node %in% x)])) data[rownames(tree$frame)[tree$where] %in% wh[wh >= node], ] } parent <- function(x) { if (x[1] != 1) c(Recall(if (x %% 2 == 0L) x / 2 else (x - 1) / 2), x) else x } 
+1
source

Source: https://habr.com/ru/post/1247538/


All Articles