Document classification using Naive Bayes Classifier

I am doing a document classifier in mahout using a simple naive bike algorithm. Currently, 98% of the data (documents) that I have belong to class A and only 2% belong to class B. My question is that there is such a wide gap in percentage of class A documents with class B documents. the classifier will be able to accurately train?

What I'm going to do is ignore a whole bunch of Class A documents and “manipulate” the data set that I have so that there isn’t such a big gap in the composition of the documents. Thus, the data set that I will end up will consist of 30% of class B and 70% of class A. But are there any consequences for what I do not know?

+4
source share
2 answers

You do not have to selectively enter dataset A to reduce your instances. There are several methods for effectively learning from unbalanced data sets, such as decomposing a majority (exactly what you did), re-electing a minority, SMOTE, etc. Here is an empirical comparison of these methods: http://machinelearning.org/proceedings/icml2007/papers/62.pdf

Alternatively, you can define a custom cost matrix for the classifier. In other words, if B = a positive class, you can determine the value (False Positive) <value (False Negative). In this case, the classifier output will shift towards the positive class. Here is a very useful tutorial: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.4418&rep=rep1&type=pdf

+1
source

Many of them understand how good "accuracy" is as an indicator of performance, and it depends on your problem. If it is wrong to classify “A” as “B” as bad / good as mistakenly classifying “B” as “A”, then there is no reason to do anything other than just mark everything as “A”, as you know that it will reliably get you at 98% accuracy (as long as this unbalanced distribution reflects the true distribution).

Without knowing your problem (and if accuracy is the measure you should use), the best answer I could give is "it depends on the data set." You may be able to get 99% accuracy with standard naive compartments, although this may be unlikely. For Naive Bayes in particular, you can do this to disable the use of suburbs (essentially a preliminary proportion of each class). This leads to the pretense that each class is equally likely, although the model parameters will be studied from unequal amounts of data.

Your proposed solution is common practice; it sometimes works well. Another practice is to create fake data for a smaller class (as it will depend on your data, for text documents I don’t know a special way). Another practice is to increase the weight of data points in underrepresented classes.

You can search for “unbalanced classification” and find much more information about these types of problems (they are one of the most difficult).

If accuracy is not really a good measure for your problem, you can find additional information on the “cost classification” that should be useful.

+2
source

Source: https://habr.com/ru/post/1487963/


All Articles