SVM Classification - minimum number of input data sets for each class

I am trying to create an application to detect images that are advertisements on web pages. As soon as I discover that I will not allow them to appear on the client side.

From the help I got in https://stackoverflow.com/a/1369532/... , I thought SVM was the best approach to my goal.

So, I myself encoded SVM and SMO. The data set that I have from the UCI data repository has 3280 instances ( Link to the data set ), where about 400 of them belong to the class representing advertising images and the rest of them are images without advertising.

Now I take the first 2800 input sets and train SVM. But, looking at the accuracy, I realized that most of these 2800 sets of input data are from the ad-free advertising class. Therefore, I get very good accuracy for this class.

So what can I do here? About how many input sets should I provide SVM for training and how many of them for each class?

Thanks. Greetings. (Basically a new question was asked because the context was different from my previous question. Optimization of the input data of a neural network )


Thanks for the answer. I want to check if I get the C values ​​for the declaration and the class without the declaration correctly. Please let me know.

enter image description here

Or you can see the doc version here .

You can see the y1 eqaul graph for y2 here enter image description here

and y1 are not equal to y2 here enter image description here

+4
source share
2 answers

There are two ways around this. It would be possible to balance the training data so that it included an equal number of advertising and non-advertising images. This can be done by oversampling 400 advertising images or underestimating thousands of images without advertising. Since training time can increase dramatically using the number of data points used, you should probably first try to understaff the images without ads and create a training set with 400 advertising images and 400 randomly selected non-advertising ads.

Another solution would be to use weighted SVM so that margin errors for advertising images are more balanced than for non-ads, for the libSVM package this is done with -wi . From your data description, you can try to weigh advertising images about 7 times more than non-ads.

+6
source

The required size of your training set depends on the sparseness of the function space. As far as I can see, you are not discussing which image features you have chosen to use. Before you can train, you need to convert each image into a vector of numbers (functions) that describe the image, hoping to capture the aspects that concern you.

Oh, and if you do not overestimate SVM for sports, I would recount just using libsvm ,

+4
source

Source: https://habr.com/ru/post/1301621/


All Articles