Precision-Recall graphical curve when using cross-validation in scikit-learn

Question

Precision-Recall graphical curve when using cross-validation in scikit-learn

I use cross-validation to evaluate classifier performance with scikit-learn , and I want to build a Precision-Recall curve. I found an example on the scikit-learn website to plot a PR curve, but does not use cross validation to evaluate.

How can I plot a Precision-Recall curve when learning scikit using cross validation?

I did the following, but I'm not sure if this is the right way to do this (psudo code):

 for each k-fold: precision, recall, _ = precision_recall_curve(y_test, probs) mean_precision += precision mean_recall += recall mean_precision /= num_folds mean_recall /= num_folds plt.plot(recall, precision)

What do you think?

Edit:

it does not work because the size of precision and recall arrays is different after each fold.

is anyone

+6

python scikit-learn

Jack twain Oct 27 '14 at 12:38

source share

2 answers

David shih · Answer 1 · 2014-12-08T20:50:07+0000

Instead of recording the accuracy and return values after each fold, save the forecasts in the test samples after each fold. Then collect all test predictions (i.e., Out of the bag) and calculate accuracy and feedback.

  ## let test_samples[k] = test samples for the kth fold (list of list) ## let train_samples[k] = test samples for the kth fold (list of list) for k in range(0, k): model = train(parameters, train_samples[k]) predictions_fold[k] = predict(model, test_samples[k]) # collect predictions predictions_combined = [p for preds in predictions_fold for p in preds] ## let predictions = rearranged predictions st they are in the original order ## use predictions and labels to compute lists of TP, FP, FN ## use TP, FP, FN to compute precisions and recalls for one run of k-fold cross-validation

With one full start of k-fold cross-validation, the predictor makes one and only one prediction for each sample. If you have n samples, you should have n test predictions.

(Note. These predictions are different from training predictions because the predictor makes a prediction for each sample without being previously seen.)

If you do not use the leave-one-out cross validation, then randomly splitting data is usually required to test the k-fold intersection. Ideally, you would do a repeated (and stratified ) cross-cross-code validation. However, the combination of precision recall curves from different rounds is not straightforward because you cannot use simple linear interpolation between repeat points, unlike ROC (see Davis and Goadrich 2006 ).

I personally calculated AUC-PR using the Davis-Goadrich method for interpolation in the PR space (followed by numerical integration) and compared classifiers using AUC-PR estimates from a repeated stratified 10-fold cross check.

For a good storyline, I showed a representative PR curve from one of the cross-validation rounds.

Of course, there are many other ways to evaluate the performance of a classifier, depending on the nature of your data set.

For example, if the proportion of (binary) labels in your dataset is not skewed (i.e., approximately 50-50), you can use a simpler ROC analysis with cross-validation:

Collect the predictions from each summary and plot the ROC curves (as before), collect all the TPR-FPR points (i.e. take the union of all the TPR-FPR tuples), and then build a combined set of points with possible smoothing. Optionally, calculate AUC-ROC using simple linear interpolation and a composite trapezoidal method for numerical integration.

Reii nakano · Answer 2 · 2017-02-24T10:40:31+0000

This is currently the best way to plot a Precision Recall curve for the sklearn classifier using cross validation. The best part is the PR Curves graphs for ALL classes, so you also get some neat looking curves

 from scikitplot.classifiers import plot_precision_recall_curve import matplotlib.pyplot as plt clf = LogisticRegression() plot_precision_recall_curve(clf, X, y) plt.show()

The function automatically performs cross-validation of a given dataset, concatenation of all bend predictions, and calculation of PR curves for each class + average PR curve. This is a one-line function that takes care of all this.

Precision recall curves

Disclaimer: note that this uses the scikit-plot library that I created.

Precision-Recall graphical curve when using cross-validation in scikit-learn

More articles: