How to find key trees / objects from a trained random forest?

Question

How to find key trees / objects from a trained random forest?

I use the Scikit-Learn Random Forest Classifier and try to extract meaningful trees / functions to better understand the prediction results.

I found this method that seems relevant in the documentation ( http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.get_params ) but could not find an example of that how to use it.

I also hope to visualize these trees, if possible, any relevant code will be great.

Thanks!

+4

scikit-learn

d1337 Jun 12 '13 at 3:14

source share

3 answers

I think you are looking for Forest.feature_importances_. This allows you to see how the relative importance of each input function matches your final model. Here is a simple example.

 import random import numpy as np from sklearn.ensemble import RandomForestClassifier #Lets set up a training dataset. We'll make 100 entries, each with 19 features and #each row classified as either 0 and 1. We'll control the first 3 features to artificially #set the first 3 features of rows classified as "1" to a set value, so that we know these are the "important" features. If we do it right, the model should point out these three as important. #The rest of the features will just be noise. train_data = [] ##must be all floats. for x in range(100): line = [] if random.random()>0.5: line.append(1.0) #Let add 3 features that we know indicate a row classified as "1". line.append(.77) line.append(.33) line.append(.55) for x in range(16):#fill in the rest with noise line.append(random.random()) else: #this is a "0" row, so fill it with noise. line.append(0.0) for x in range(19): line.append(random.random()) train_data.append(line) train_data = np.array(train_data) # Create the random forest object which will include all the parameters # for the fit. Make sure to set compute_importances=True Forest = RandomForestClassifier(n_estimators = 100, compute_importances=True) # Fit the training data to the training output and create the decision # trees. This tells the model that the first column in our data is the classification, # and the rest of the columns are the features. Forest = Forest.fit(train_data[0::,1::],train_data[0::,0]) #now you can see the importance of each feature in Forest.feature_importances_ # these values will all add up to one. Let call the "important" ones the ones that are above average. important_features = [] for x,i in enumerate(Forest.feature_importances_): if i>np.average(Forest.feature_importances_): important_features.append(str(x)) print 'Most important features:',', '.join(important_features) #we see that the model correctly detected that the first three features are the most important, just as we expected!

+16

Justin muller Jun 21 '13 at 16:16

source share

This is how I render the tree:

First make a model after you have done all the preprocessing, splitting, etc.:

 # max number of trees = 100 from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train)

Make predictions:

 # Predicting the Test set results y_pred = classifier.predict(X_test)

Then plot the values. The dataset variable is the name of the source frame.

 # get importances from RF importances = classifier.feature_importances_ # then sort them descending indices = np.argsort(importances) # get the features from the original data set features = dataset.columns[0:26] # plot them with a horizontal bar chart plt.figure(1) plt.title('Feature Importances') plt.barh(range(len(indices)), importances[indices], color='b', align='center') plt.yticks(range(len(indices)), features[indices]) plt.xlabel('Relative Importance')

This gives a graph as shown below:

0

Bryan butler Jun 07 '17 at 17:52

source share

ogrisel · Accepted Answer · 2013-06-12T12:23:37+0000

For relative features, read the relevant section of the documentation along with the code for related examples in the same section.

The trees themselves are stored in the estimators_ attribute of the random forest instance (only after calling the fit method). Now, to extract the "key tree", you first need to determine what it is and what you expect from it.

You can rank individual trees by calculating the counter there on the test set, but I don’t know what to expect from this.

Do you want to prune the forest to accelerate its forecasting by reducing the number of trees without reducing the overall accuracy of the forest?

How to find key trees / objects from a trained random forest?

More articles: