GridSearchCV evaluation parameter: using scoring = 'f1' or scoring = None (uses precision by default) gives the same result

I am using an example extracted from the book “Mastering Machine Learning with scikit learn”.

It uses a decision tree to predict whether each of the images on the web page is advertising or the contents of the article. Images that are classified as advertisements may be hidden using cascading style sheets. The data is publicly available from the online advertising data set: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements , which contains data for 3,279 images.

The following is the complete code for performing the classification task:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
import sys,random

def main(argv):
    df = pd.read_csv('ad-dataset/ad.data', header=None)
    explanatory_variable_columns = set(df.columns.values)
    response_variable_column = df[len(df.columns.values)-1]


    explanatory_variable_columns.remove(len(df.columns.values)-1)
    y = [1 if e == 'ad.' else 0 for e in response_variable_column]
    X = df[list(explanatory_variable_columns)]

    X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)

    X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=100000)

    pipeline = Pipeline([('clf',DecisionTreeClassifier(criterion='entropy',random_state=20000))])

    parameters = {
        'clf__max_depth': (150, 155, 160),
        'clf__min_samples_split': (1, 2, 3),
        'clf__min_samples_leaf': (1, 2, 3)
    }

    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1, scoring='f1')
    grid_search.fit(X_train, y_train)
    print 'Best score: %0.3f' % grid_search.best_score_
    print 'Best parameters set:'
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print '\t%s: %r' % (param_name, best_parameters[param_name])

    predictions = grid_search.predict(X_test)
    print classification_report(y_test, predictions)


if __name__ == '__main__':
  main(sys.argv[1:])

scoring = 'f1' GridSearchCV, :

F1 SCORE Results

scoring = None ( Accuracy measure) F1:

Accuracy SCORE Results

, . , scoring = 'precision'.

scoring = 'precision' , . "" ..:

Precision SCORE Results

"F1" "", , ?

EDITED

. param_grid. , , ( , ), 100: 1 ( ) . "F1" .

param_grid :

parameters = {"penalty": ("l1", "l2"),
    "C": (0.001, 0.01, 0.1, 1, 10, 100),
    "solver": ("newton-cg", "lbfgs", "liblinear"),
    "class_weight":[{0:4}],
}

, .

+4
3

, . , - , min_samples_split=1 : , min_samples_split=2, 1 - , .

: min_samples_split: " , node."

Btw. , , , accuracy f1 , , .

, , F1 . , GridSearch, (a) F1 (b) , , 150 . , , . , , ( "" ).

, ,

parameters = {
    'clf__max_depth': list(range(2, 30)),
    'clf__min_samples_split': (2,),
    'clf__min_samples_leaf': (1,)
}

"" F1 15.

Best score: 0.878
Best parameters set:
    clf__max_depth: 15
    clf__min_samples_leaf: 1
    clf__min_samples_split: 2
             precision    recall  f1-score   support

          0       0.98      0.99      0.99       716
          1       0.92      0.89      0.91       104

avg / total       0.98      0.98      0.98       820

"" ( None) :

> Best score: 0.967
Best parameters set:
    clf__max_depth: 6
    clf__min_samples_leaf: 1
    clf__min_samples_split: 2
             precision    recall  f1-score   support

          0       0.98      0.99      0.98       716
          1       0.93      0.85      0.88       104

avg / total       0.97      0.97      0.97       820

, , "" , "".

+2

, . ( ), , F1 .

, , GridSearchCV . , , .

+1

In an unbalanced dataset, use the "labels" parameter of the f1_score counter to use only the f1 estimate of the class you are interested in. Or consider using "sample_weight".

0
source

Source: https://habr.com/ru/post/1609808/


All Articles