Classification of tensor flow with extremely unbalanced dataset

I use TensorFlow LinearClassifier as well as DNN to classify a dataset from two classes.

However, the problem is that the data set contains 96% of the positive result and 4% of the negative result, and my program always returns the forecast as positive. Of course, in this case, I would achieve an accuracy of 96%, but that makes no sense.

What is a good way to handle this situation?

+5
source share
4 answers

You can try to change the cost function so that a false positive result is fined more heavily than a false one.

+4
source

Here's what you can do with the simplest solutions:

  • You can create mini filters that select classes equally, and then recalibrate the model during testing.
  • You can recount the examples to support the negatives.
  • You can use loop loss instead of log loss, which may be more resistant to unbalanced data, since it will not get a gradient if the example is true outside the field.
  • You can study some other loss functions that asymptotically relate to different types of errors.
+3
source

You can recognize the autocoder using the negative examples you have (if the number is large), and then generate the examples using an output method such as variational Bayes or Markov Monte Carlo chain . This way you can increase the number of samples for negative examples and move on to a more balanced dataset.

+2
source

You can check this document for different sampling methods to mitigate the imbalance class problem http://www.machinelearning.org/proceedings/icml2007/papers/62.pdf . Simple random oversampling of a minority usually works better.

0
source

Source: https://habr.com/ru/post/1239374/


All Articles