How to calculate the probability (confidence) of SVM classification for a small data set?

Use case :

I have a small data set with approximately 3-10 samples in each class. I use sklearn SVC to classify those who have rbf . I prediction confidence is required along with the predicted class. I used the pred_proba SVC method. With this, I got strange results. I searched a bit and found out that this only makes sense for larger datasets.

Found this question on the Scikit-learn stack pred_proba gives incorrect answers .

The author of the question confirmed this by multiplying the data set, thereby duplicating the data set.

My questions:

1) If I multiply my data set, say 100, having each sample 100 times, it increases the "correctness" for "predict_proba". What side effects will he have? Retraining?

2) Is there any other way to calculate the confidence of the classifier? How is the distance from hyperplanes?

3) For this small sample size, is the SVM recommended algorithm or should I choose something else?

+5
source share
1 answer

First of all: your dataset seems very small for any practical purpose. Having said that, let's see what we can do.

SVMs are mostly popular in large settings. It is currently unclear whether this applies to your project. They build planes on several (or even single) supporting instances, and often outperform Neural Nets in large training situations. A priori, they may not be your worst choice.

Redistributing your data will not help much when using SVM. SVM is based on the concept of support vectors, which are mainly class outliers that determine what is in the class and what is not. Oversampling will not create a new support vector (I assume that you are already using a train set as a test set).

The usual oversampling in this scenario will also not give you any new information about certainty other than artifacts created by unbalanced oversampling, since the instances will be exact copies and no distribution changes will occur. You can find some information using SMOTE (Synthetic Minority Re-election Methodology). Basically, you will create synthetic instances based on the ones you have. Theoretically, this will provide you with new instances that will not be exact copies of the ones you have, and may thus be slightly different from the usual classification. Note. By definition, all of these examples will be between the original examples in your sample space. This does not mean that they will be between the predicted SVM space, possibly with training effects, which are actually not the case.

Finally, you can evaluate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline

0
source

Source: https://habr.com/ru/post/1261240/


All Articles