Data imbalance in SVM using libSVM

How do I adjust gamma and cost parameters in libSVM when I use an unbalanced dataset that consists of 75% of the โ€œtrueโ€ tags and 25% of the โ€œfalseโ€ tags? I get a permanent error in that all predicted labels are set to "True" due to data imbalance.

If the problem is not with libSVM, but with my dataset, how should I deal with this imbalance in terms of theoretical machine learning? * The number of functions I use is from 4 to 10, and I have a small set of 250 data points.

+6
source share
3 answers

The imbalance of classes has nothing to do with the choice of C and gamma, to solve this problem you should use the weight scheme class , which is available, for example, scikit-learn (built on libsvm )

Choosing the best C and gamma is done using cross-validated grid search. Here you should try a huge range of values, for C it is reasonable to choose values โ€‹โ€‹between 1 and 10^15 , while the simple and good heuristic of the gamma range values โ€‹โ€‹is to calculate pairwise distances between all your data points and select the gamma according to the percentiles of this distribution - think about adding a Gaussian distribution with a dispersion of 1/gamma at each point - if you choose such a gamma that this distribution overlaps, many points you will get a very "smooth" model, when using a small dispersion leads to retraining.

+6
source

Unbalanced datasets can be addressed in various ways. Class balance does not affect kernel parameters such as gamma for the RBF kernel.

The two most popular approaches are:

  • Use different penalties for misclassification in the class , this basically means changing C As a rule, the smallest class is weighted higher, the general approach is npos * wpos = nneg * wneg . LIBSVM allows you to do this using the -wX flags.
  • Group an overly represented class to get an equal number of positive and negative effects and continue learning, as you would traditionally use for a balanced set. Note that you basically ignore a large chunk of data in a way that is an intuitively bad idea.
+4
source

I know this was asked a while ago, but I would like to answer it, as you may find my answer helpful.

As already mentioned, you might consider using different weights for minority classes or using different penalties for incorrect classification. However, there is a smarter way to deal with unbalanced data sets.

You can use SMOTE (Synthetic O ver-sampling te chnique) to generate synthesized data for a minority class. This is a simple algorithm that handles some imbalance data sets very well.

In each iteration of the algorithm, SMOTE considers two random instances of the minority class and adds an artificial example of the same class somewhere in the middle. The algorithm continues to enter the data set with samples until the two classes become balanced or some other criteria (for example, add a certain number of examples). Below you can find an image describing the algorithm for a simple data set in the space of 2D objects.

Associated weight with a minority class is a special case of this algorithm. When you associate the weight of $ w_i $ with instance i, you basically add additional $ w_i - 1 $ instances on top of instance i!

SMOTE

  • What you need to do is increase the original dataset using the samples created by this algorithm and prepare SVM for this new dataset. You can also find many online implementations in different languages โ€‹โ€‹such as Python and Matlab.

  • There were other extensions to this algorithm, I can tell you more materials if you want.

  • To test the classifier, you need to split the data set into a test and train, add synthetic instances to the train set ( DO NOT ADD ANY INSTALLATION TEST ), prepare the model for the train set, and finally check it on the test set. If you look at the generated instances when testing, you get biased (and ridiculously higher) accuracy and feedback.

+2
source

Source: https://habr.com/ru/post/954925/


All Articles