Cross Validation + Decision Trees in Sklearn

Trying to create a cross-validated decision tree using sklearn and panads.

My question is in the code below, cross validation breaks the data, which I then use for training and testing. I will try to find the best depth of the tree by recreating it n times with different maximum depths. When using cross validation, I have to use k folds CV, and if so, how will I use this in the code that I have?

import numpy as np import pandas as pd from sklearn import tree from sklearn import cross_validation features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"] df = pd.read_csv('magic04.data',header=None,names=features) df['class'] = df['class'].map({'g':0,'h':1}) x = df[features[:-1]] y = df['class'] x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0) depth = [] for i in range(3,20): clf = tree.DecisionTreeClassifier(max_depth=i) clf = clf.fit(x_train,y_train) depth.append((i,clf.score(x_test,y_test))) print depth 

here is a link to the data i use in case someone helps. https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope

+5
source share
1 answer

In your code, you create a static test test. If you want to select the best depth by cross-checking, you can use sklearn.cross_validation.cross_val_score inside the for loop.

You can read the sklearn documentation for more information.

Here is the update of your code from CV:

 import numpy as np import pandas as pd from sklearn import tree from sklearn.cross_validation import cross_val_score from pprint import pprint features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"] df = pd.read_csv('magic04.data',header=None,names=features) df['class'] = df['class'].map({'g':0,'h':1}) x = df[features[:-1]] y = df['class'] # x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0) depth = [] for i in range(3,20): clf = tree.DecisionTreeClassifier(max_depth=i) # Perform 7-fold cross validation scores = cross_val_score(estimator=clf, X=x, y=y, cv=7, n_jobs=4) depth.append((i,scores.mean())) print(depth) 

Alternatively, you can use sklearn.grid_search.GridSearchCV and not write a for loop yourself, especially if you want to optimize more than one hyperparameter.

 import numpy as np import pandas as pd from sklearn import tree from sklearn.model_selection import GridSearchCV features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"] df = pd.read_csv('magic04.data',header=None,names=features) df['class'] = df['class'].map({'g':0,'h':1}) x = df[features[:-1]] y = df['class'] parameters = {'max_depth':range(3,20)} clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4) clf.fit(X=x, y=y) tree_model = clf.best_estimator_ print (clf.best_score_, clf.best_params_) 

Edit: The way GridSearchCV is imported to post the learn2day comment has changed.

+15
source

Source: https://habr.com/ru/post/1241861/


All Articles