Graphviz Interpretation for Decision Tree Regression

I'm curious that the value field is in the nodes of the decision tree created by Graphviz when used for regression. I understand that this is the number of samples in each class, partitioned by using the decision tree tree, but I'm not sure what this means for regression.

My data has 2-dimensional input and 10-dimensional output. Here is an example of what the tree looks like for my regression problem:

enter image description here

created using this code and rendered using webgraphviz

 # X = (nx 2) Y = (nx 10) X_test = (mx 2) input_scaler = pickle.load(open("../input_scaler.sav","rb")) reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2) reg.fit(X,Y) pred = reg.predict(X_test) with open("classifier.txt", "w") as f: f = tree.export_graphviz(reg, out_file=f) 

thanks!

+5
source share
1 answer

That the regression tree actually returns as output is the average value of the dependent variable (here Y) of the training samples that fall into the corresponding terminal nodes (leaves); these averages are displayed as lists with the name value in the picture, which are 10 here, since your Y is 10-dimensional.

In other words, using the leftmost output node (leaf) of your tree as an example:

  • The sheet consists of 42 samples for which X[0] <= 0.675 and X[1] <= 0.5
  • The average value of your 10-dimensional output for these 42 samples is given in the value list of this vacation, which really has a length of 10, that is, the average value of Y[0] is -152007.382 , the average value of Y[1] is -206040.675 , etc. and the average Y[9] is 3211.487 .

You can confirm that this is so by predicting some samples (from your training or test suite - it does not matter) and verifying that your 10-dimensional result is one of the 4 value lists shown on the terminal leaves above.

In addition, you can confirm that for each element in value weighted average values ​​of the child nodes are equal to the corresponding element of the parent node element. Again, using the first element of your 2 leftmost end nodes (leaves), we get:

 (-42*152007.382 - 56*199028.147)/98 # -178876.39057142858 

i.e. the value[0] element of their parent node (the leftmost node at the intermediate level). Another example, this time for the first value elements of your 2 intermediate nodes:

 (-98*178876.391 + 42*417378.245)/140 # -0.00020000000617333822 

which again matches with -0.0 first value element of your root node.

Judging from the value list of your root node, it seems that the average values ​​of all the elements of your 10-dimensional Y are almost zero, which you can (and should) check manually as a final confirmation.


So to complete:

  • The value list of each node contains the average Y values ​​for training samples belonging to the corresponding node
  • In addition, for terminal nodes (sheets), these lists are the actual conclusions of the tree model (i.e., the output will always be one of these lists, depending on X)
  • For the root node, the value list contains the average Y values ​​for your entire training set.
+1
source

Source: https://habr.com/ru/post/1274483/


All Articles