Graphviz Interpretation for Decision Tree Regression

Question

Graphviz Interpretation for Decision Tree Regression

I'm curious that the value field is in the nodes of the decision tree created by Graphviz when used for regression. I understand that this is the number of samples in each class, partitioned by using the decision tree tree, but I'm not sure what this means for regression.

My data has 2-dimensional input and 10-dimensional output. Here is an example of what the tree looks like for my regression problem:

created using this code and rendered using webgraphviz

 # X = (nx 2) Y = (nx 10) X_test = (mx 2) input_scaler = pickle.load(open("../input_scaler.sav","rb")) reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2) reg.fit(X,Y) pred = reg.predict(X_test) with open("classifier.txt", "w") as f: f = tree.export_graphviz(reg, out_file=f)

thanks!

+5

scikit-learn graphviz machine-learning regression decision-tree

Reenee swischuk Jan 03 '18 at 21:23

source share

1 answer

desertnaut · Accepted Answer · 2018-01-03T23:16:05+0000

That the regression tree actually returns as output is the average value of the dependent variable (here Y) of the training samples that fall into the corresponding terminal nodes (leaves); these averages are displayed as lists with the name value in the picture, which are 10 here, since your Y is 10-dimensional.

In other words, using the leftmost output node (leaf) of your tree as an example:

The sheet consists of 42 samples for which X[0] <= 0.675 and X[1] <= 0.5
The average value of your 10-dimensional output for these 42 samples is given in the value list of this vacation, which really has a length of 10, that is, the average value of Y[0] is -152007.382 , the average value of Y[1] is -206040.675 , etc. and the average Y[9] is 3211.487 .

You can confirm that this is so by predicting some samples (from your training or test suite - it does not matter) and verifying that your 10-dimensional result is one of the 4 value lists shown on the terminal leaves above.

In addition, you can confirm that for each element in value weighted average values of the child nodes are equal to the corresponding element of the parent node element. Again, using the first element of your 2 leftmost end nodes (leaves), we get:

 (-42*152007.382 - 56*199028.147)/98 # -178876.39057142858

i.e. the value[0] element of their parent node (the leftmost node at the intermediate level). Another example, this time for the first value elements of your 2 intermediate nodes:

 (-98*178876.391 + 42*417378.245)/140 # -0.00020000000617333822

which again matches with -0.0 first value element of your root node.

Judging from the value list of your root node, it seems that the average values of all the elements of your 10-dimensional Y are almost zero, which you can (and should) check manually as a final confirmation.

So to complete:

The value list of each node contains the average Y values for training samples belonging to the corresponding node
In addition, for terminal nodes (sheets), these lists are the actual conclusions of the tree model (i.e., the output will always be one of these lists, depending on X)
For the root node, the value list contains the average Y values for your entire training set.

Graphviz Interpretation for Decision Tree Regression

More articles: