I have a lot of problems understanding how the class_weight parameter class_weight in scikit-learn Logistic Regression.
Situation
I want to use logical regression to perform binary classification on a very unbalanced dataset. Classes are designated 0 (negative) and 1 (positive), and the observed data are in a ratio of about 19: 1, with most of the samples having a negative result.
First attempt: manually prepare training data
I divided the data that I had into disjoint sets for training and testing (about 80/20). Then I accidentally tested the training data manually to get training data in different proportions than 19: 1; from 2: 1 โ 16: 1.
Then I trained logistic regression on these different subsets of the training data and plotted a sketch (= TP / (TP + FN)) depending on different training proportions. Of course, the review was calculated on disjoint TEST samples, which had an observed proportion of 19: 1. Note that although I trained different models with different training data, I calculated the review for all of them using the same (disjoint) test data.
The results were as expected: the response was about 60% with a training ratio of 2: 1 and was rapidly declining by the time it reached 16: 1. There were several proportions 2: 1 โ 6: 1, where the response was decently above 5%.
Second try: grid search
Then I wanted to test various regularization parameters, so I used GridSearchCV and made a grid of several values โโof the C parameter, as well as the class_weight parameter. To translate my proportions n: m negative: positive learning patterns into the language of the class_weight dictionary, I thought that I simply specify several dictionaries as follows:
{ 0:0.67, 1:0.33 }
and I also included None and auto .
This time, the results were completely ruined. All of my reminders came out tiny (<0.05) for every class_weight value except auto . Therefore, I can only assume that my understanding of how to set the class_weight dictionary is wrong. Interestingly, the value of class_weight "auto" in the grid search was about 59% for all C values, and I guessed that it was 1: 1?
My questions
1) How do you correctly use class_weight to achieve different balances in training data from what you actually give it? In particular, which dictionary do I pass to class_weight to use n: m negative proportions: positive learning patterns?
2) If you pass various class_weight dictionaries to GridSearchCV, during cross-validation, it rebalances the training fold data according to the dictionary, but uses the true given proportions to calculate my count function on the test fold. This is important because any metric is only useful to me if it comes from data in observable proportions.
3) What does the auto value of class_weight with respect to proportions? I read the documentation, and I suppose that "balances the data inversely proportional to their frequency" just means that it does it 1: 1. It's right? If not, can someone clarify?
Thank you very much, any clarifications would be greatly appreciated!