Random Forest Recycle

Question

Random Forest Recycle

I use scikit-learn with a stratified CV to compare some classifiers. I calculate: accuracy, recall, auc.

I used to optimize the parameters of GridSearchCV with 5 CV.

RandomForestClassifier(warm_start= True, min_samples_leaf= 1, n_estimators= 800, min_samples_split= 5,max_features= 'log2', max_depth= 400, class_weight=None)

are the best pairs from GridSearchCV.

My problem is, I think I'm really pouting. For instance:

Random forest with standard deviation (+/-)
accuracy: 0.99 (+/- 0.06)
sensitivity: 0.94 (+/- 0.06)
specificity: 0.94 (+/- 0.06)
B_accuracy: 0.94 (+/- 0.06)
AUC: 0.94 (+/- 0.11)
Logistic Regression with Standard Deviation (+/-)
accuracy: 0.88 (+/- 0.06)
Sensitivity: 0.79 (+/- 0.06)
specificity: 0.68 (+/- 0.06)
B_accuracy: 0.73 (+/- 0.06)
AUC: 0.73 (+/- 0.041)

And others also look like logistic regression (so they don't look refitted).

My code for CV:

 for i,j in enumerate(data): X.append(data[i][0]) y.append(float(data[i][1])) x=np.array(X) y=np.array(y) def SD(values): mean=sum(values)/len(values) a=[] for i in range(len(values)): a.append((values[i]-mean)**2) erg=sum(a)/len(values) SD=math.sqrt(erg) return SD,mean for name, clf in zip(titles,classifiers): # go through all classifiers, compute 10 folds # the next for loop should be 1 tab indent more, coudlnt realy format it here, sorry pre,sen,spe,ba,area=[],[],[],[],[] for train_index, test_index in skf: #print train_index, test_index #get the index from all train_index and test_index #change them to list due to some errors train=train_index.tolist() test=test_index.tolist() X_train=[] X_test=[] y_train=[] y_test=[] for i in train: X_train.append(x[i]) for i in test: X_test.append(x[i]) for i in train: y_train.append(y[i]) for i in test: y_test.append(y[i]) #clf=clf.fit(X_train,y_train) #predicted=clf.predict_proba(X_test) #... other code, calculating metrics and so on... print name print("precision: %0.2f \t(+/- %0.2f)" % (SD(pre)[1], SD(pre)[0])) print("sensitivity: %0.2f \t(+/- %0.2f)" % (SD(sen)[1], SD(pre)[0])) print("specificity: %0.2f \t(+/- %0.2f)" % (SD(spe)[1], SD(pre)[0])) print("B_accuracy: %0.2f \t(+/- %0.2f)" % (SD(ba)[1], SD(pre)[0])) print("AUC: %0.2f \t(+/- %0.2f)" % (SD(area)[1], SD(area)[0])) print "\n"

If I use the scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy') , I do not get these "override" values. So maybe something is wrong with the CV method that I use? But this is only for the Russian Federation ...

I made my own due to the delay of the specificity function in the cross_val_function function.

+5

python scikit-learn machine-learning random-forest

auronsen Nov 27 '15 at 0:55

source share

1 answer

Sebastian · Accepted Answer · 2015-11-27T03:10:33+0000

Herbert,

if your goal is to compare different learning algorithms, I recommend that you use nested cross validation. (I mean the learning algorithm as various algorithms, such as logistic regression, decision trees and other discriminatory models that study a hypothesis or model - the final classifier - from your training data).

"Regular" cross-validation is great if you like to configure hyperparameters of one algorithm. However, as soon as you start optimizing hyperparameters with the same cross-validation parameters / folds, your performance assessment is likely to be excessive. The reason that you use cross-validation over and over again is that your test data will become, to some extent, “learning data”.

People asked me this question quite often, in fact, and I will take a few excerpts from the FAQ section that I posted here: http://sebastianraschka.com/faq/docs/evaluate-a-model.html

In the nested cross-validation, we have an external k-fold cross-validation loop for dividing the data into training and test folds, and the internal loop is used to select a model using k-fold cross-validation in a fold training. After selecting a model, the test fold is used to evaluate the performance of the model. After we have determined our “favorite” algorithm, we can follow the “regular” cross-validation k-fold approach (on a full set of workouts) to find its “optimal” hyperparameters and evaluate it on an independent test set. Let's look at a logistic regression model to make this clearer: using nested cross-validation, you will train different logistic regression models, 1 for each of the outer folds m, and the inner folds are used to optimize the hyperparameters of each model (for example, using gridsearch in combination with cross-validation k-fold.If your model is stable, these m models should have the same hyperparameter values, and you report the average performance of this model based on external test bends in. The following algorithm, for example, SVM etc.

I can only recommend this wonderful article, which discusses this issue in more detail:

S. Varma and R. Simon. Bias in error estimation when using cross-validation to select a model. BMin bioinformatics, 7 (1): 91, 2006. ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1397873/ )

PS: As a rule, you do not need / do not need to configure hyperparameters of a random forest (so wide). The idea of creating random forests (bag shape) should not really cut off decision trees - in fact, one of the reasons why Breiman came up with the Random Forest algorithm was to deal with the problem of trimming / retraining individual decision trees. So the only parameter you really need to worry about is the number of trees (and possibly the number of random functions for the tree). However, as a rule, it is best for you to take training bootstraps with sizes n (where n is the initial number of functions in the training set) and squareroot (m) (where m is the dimension of your training set).

Hope this was helpful!

Edit:

Sample code for entering a nested CV through scikit-learn:

 pipe_svc = Pipeline([('scl', StandardScaler()), ('clf', SVC(random_state=1))]) param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0] param_grid = [{'clf__C': param_range, 'clf__kernel': ['linear']}, {'clf__C': param_range, 'clf__gamma': param_range, 'clf__kernel': ['rbf']}] # Nested Cross-validation (here: 5 x 2 cross validation) # ===================================== gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid, scoring='accuracy', cv=5) scores = cross_val_score(gs, X_train, y_train, scoring='accuracy', cv=2) print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

Random Forest Recycle

More articles: