How does the class_weight parameter work in scikit-learn?

Question

How does the class_weight parameter work in scikit-learn?

I have a lot of problems understanding how the class_weight parameter class_weight in scikit-learn Logistic Regression.

Situation

I want to use logical regression to perform binary classification on a very unbalanced dataset. Classes are designated 0 (negative) and 1 (positive), and the observed data are in a ratio of about 19: 1, with most of the samples having a negative result.

First attempt: manually prepare training data

I divided the data that I had into disjoint sets for training and testing (about 80/20). Then I accidentally tested the training data manually to get training data in different proportions than 19: 1; from 2: 1 → 16: 1.

Then I trained logistic regression on these different subsets of the training data and plotted a sketch (= TP / (TP + FN)) depending on different training proportions. Of course, the review was calculated on disjoint TEST samples, which had an observed proportion of 19: 1. Note that although I trained different models with different training data, I calculated the review for all of them using the same (disjoint) test data.

The results were as expected: the response was about 60% with a training ratio of 2: 1 and was rapidly declining by the time it reached 16: 1. There were several proportions 2: 1 → 6: 1, where the response was decently above 5%.

Second try: grid search

Then I wanted to test various regularization parameters, so I used GridSearchCV and made a grid of several values of the C parameter, as well as the class_weight parameter. To translate my proportions n: m negative: positive learning patterns into the language of the class_weight dictionary, I thought that I simply specify several dictionaries as follows:

 { 0:0.67, 1:0.33 } #expected 2:1 { 0:0.75, 1:0.25 } #expected 3:1 { 0:0.8, 1:0.2 } #expected 4:1

and I also included None and auto .

This time, the results were completely ruined. All of my reminders came out tiny (<0.05) for every class_weight value except auto . Therefore, I can only assume that my understanding of how to set the class_weight dictionary is wrong. Interestingly, the value of class_weight "auto" in the grid search was about 59% for all C values, and I guessed that it was 1: 1?

My questions

1) How do you correctly use class_weight to achieve different balances in training data from what you actually give it? In particular, which dictionary do I pass to class_weight to use n: m negative proportions: positive learning patterns?

2) If you pass various class_weight dictionaries to GridSearchCV, during cross-validation, it rebalances the training fold data according to the dictionary, but uses the true given proportions to calculate my count function on the test fold. This is important because any metric is only useful to me if it comes from data in observable proportions.

3) What does the auto value of class_weight with respect to proportions? I read the documentation, and I suppose that "balances the data inversely proportional to their frequency" just means that it does it 1: 1. It's right? If not, can someone clarify?

Thank you very much, any clarifications would be greatly appreciated!

+42

python scikit-learn

Ministry Jun 22 '15 at 4:11

source share

1 answer

Andreas Mueller · Accepted Answer · 2015-06-22 14:39

First, it may not be good to just survive one. You can simply remind 100% by classifying everything as a positive class. I usually suggest using AUC to select parameters, and then find the threshold for the operating point (for example, a given level of accuracy) that interests you.

How class_weight works: it punishes errors in class[i] samples with class_weight[i] instead of 1. Thus, a higher weight class means that you want to pay more attention to the class. From what you are saying, it seems that class 0 is 19 times more likely than class 1. Thus, you should increase the class_weight class 1 relative to class 0, for example {0: .1, 1: .9}. If class_weight does not class_weight with 1, this will basically change the regularization parameter.

How class_weight="auto" works, you can watch this discussion . In the dev version, you can use class_weight="balanced" , which is easier to understand: basically it means replicating the smaller class until you have as many samples as the larger one, but in an implicit way.

How does the class_weight parameter work in scikit-learn?

More articles: