GridSearchCV evaluation parameter: using scoring = 'f1' or scoring = None (uses precision by default) gives the same result

Question

GridSearchCV evaluation parameter: using scoring = 'f1' or scoring = None (uses precision by default) gives the same result

I am using an example extracted from the book “Mastering Machine Learning with scikit learn”.

It uses a decision tree to predict whether each of the images on the web page is advertising or the contents of the article. Images that are classified as advertisements may be hidden using cascading style sheets. The data is publicly available from the online advertising data set: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements , which contains data for 3,279 images.

The following is the complete code for performing the classification task:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
import sys,random

def main(argv):
    df = pd.read_csv('ad-dataset/ad.data', header=None)
    explanatory_variable_columns = set(df.columns.values)
    response_variable_column = df[len(df.columns.values)-1]


    explanatory_variable_columns.remove(len(df.columns.values)-1)
    y = [1 if e == 'ad.' else 0 for e in response_variable_column]
    X = df[list(explanatory_variable_columns)]

    X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)

    X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=100000)

    pipeline = Pipeline([('clf',DecisionTreeClassifier(criterion='entropy',random_state=20000))])

    parameters = {
        'clf__max_depth': (150, 155, 160),
        'clf__min_samples_split': (1, 2, 3),
        'clf__min_samples_leaf': (1, 2, 3)
    }

    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1, scoring='f1')
    grid_search.fit(X_train, y_train)
    print 'Best score: %0.3f' % grid_search.best_score_
    print 'Best parameters set:'
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print '\t%s: %r' % (param_name, best_parameters[param_name])

    predictions = grid_search.predict(X_test)
    print classification_report(y_test, predictions)


if __name__ == '__main__':
  main(sys.argv[1:])

scoring = 'f1' GridSearchCV, :

scoring = None ( Accuracy measure) F1:

, . , scoring = 'precision'.

scoring = 'precision' , . "" ..:

"F1" "", , ?

EDITED

. param_grid. , , ( , ), 100: 1 ( ) . "F1" .

param_grid :

parameters = {"penalty": ("l1", "l2"),
    "C": (0.001, 0.01, 0.1, 1, 10, 100),
    "solver": ("newton-cg", "lbfgs", "liblinear"),
    "class_weight":[{0:4}],
}

, .

+4

scikit-learn statistics machine-learning decision-tree grid-search

Pablo Fleurquin 01 . '15 14:19

3

, . ( ), , F1 .

, , GridSearchCV . , , .

+1

Fabian Pedregosa 01 . '15 16:06

In an unbalanced dataset, use the "labels" parameter of the f1_score counter to use only the f1 estimate of the class you are interested in. Or consider using "sample_weight".

0

Diego Dec 22 '15 at 23:06

source share

Sebastian · Accepted Answer · 2015-10-01T20:54:36+0000

, . , - , min_samples_split=1 : , min_samples_split=2, 1 - , .

: min_samples_split: " , node."

Btw. , , , accuracy f1 , , .

, , F1 . , GridSearch, (a) F1 (b) , , 150 . , , . , , ( "" ).

, ,

parameters = {
    'clf__max_depth': list(range(2, 30)),
    'clf__min_samples_split': (2,),
    'clf__min_samples_leaf': (1,)
}

"" F1 15.

Best score: 0.878
Best parameters set:
    clf__max_depth: 15
    clf__min_samples_leaf: 1
    clf__min_samples_split: 2
             precision    recall  f1-score   support

          0       0.98      0.99      0.99       716
          1       0.92      0.89      0.91       104

avg / total       0.98      0.98      0.98       820

"" ( None) :

> Best score: 0.967
Best parameters set:
    clf__max_depth: 6
    clf__min_samples_leaf: 1
    clf__min_samples_split: 2
             precision    recall  f1-score   support

          0       0.98      0.99      0.98       716
          1       0.93      0.85      0.88       104

avg / total       0.97      0.97      0.97       820

, , "" , "".

GridSearchCV evaluation parameter: using scoring = 'f1' or scoring = None (uses precision by default) gives the same result

"F1" "", , ?

EDITED

More articles: