Bayes naive probability always 1

I started using sklearn.naive_bayes.GaussianNB to classify the text and got great initial results. I want to use the probability returned by the classifier as a measure of confidence, but the pred_proba () method always returns "1.0" for the selected class and "0.0" for everyone else.

I know (from here ) that "... the probability exits from pred_proba should not be taken too seriously," but for that degree ?! The classifier may mistakenly accept investment or chord strings, but the output of predict_proba () does not show any fluctuations ...

A little about the context:
- I used sklearn.feature_extraction.text.TfidfVectorizer to extract a function, without, for starters, restricting vocabulary using stop_words or min / max_df β†’ I get very large vectors.
- I studied the classifier in a hierarchical tree of categories (shallow: no more than 3 layers) with 7 texts (manually classified) for each category. Currently training is flat : I do not take hierarchy into account.

The resulting GaussianNB object is very large (~ 300 MB), and the prediction is rather slow: about 1 second for a single text.
Could this be related? Are huge vectors at the root of all this?
How do I get meaningful forecasts? Do I need to use a different classifier?

Here is the code I'm using:

 from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.externals import joblib Vectorizer = TfidfVectorizer(input = 'content') vecs = Vectorizer.fit_transform(TextsList) # ~2000 strings joblib.dump(Vectorizer, 'Vectorizer.pkl') gnb = GaussianNB() Y = np.array(TargetList) # ~2000 categories gnb.fit(vecs.toarray(), Y) joblib.dump(gnb, 'Classifier.pkl') ... #In a different function: Vectorizer = joblib.load('Vectorizer.pkl') Classifier = joblib.load('Classifier.pkl') InputList = [Text] # One string Vec = Vectorizer.transform(InputList) Probs = Classifier.predict_proba([Vec.toarray()[0]])[0] MaxProb = max(Probs) MaxProbIndex = np.where(Probs==MaxProb)[0][0] Category = Classifier.classes_[MaxProbIndex] result = (Category, MaxProb) 

Update:
Following the tips below, I tried MultinomialNB and LogisticRegression . They both return different probabilities and are better suited for my task: a much more accurate classification, smaller objects in memory and a much higher speed ( MultinomialNB - lightning fast!).

Now I have a new problem: the returned probabilities are very small - usually in the range of 0.004-0.012. This belongs to the forecast / win category (and the classification is accurate).

+4
source share
1 answer

"... probabilistic exits from pred_proba should not be taken too seriously"

I am the guy who wrote this. The fact is that naive Bayes tend to predict probabilities that are almost always either very close to zero or very close to unity; exactly the behavior you are observing. Logistic regression ( sklearn.linear_model.LogisticRegression or sklearn.linear_model.SGDClassifier(loss="log") ) gives more realistic probabilities.

The resulting GaussianNB object is very large (~ 300 MB), and the prediction is rather slow: about 1 second for a single text.

This is because GaussianNB is a non-linear model and does not support sparse matrices (which you already learned because you use toarray ). Use MultinomialNB , BernoulliNB or Logical Regression, which are much faster when predicting, and also less. Their assumptions by. input is also more realistic for term functions. GaussianNB really is not a good grade for text classification.

+9
source

Source: https://habr.com/ru/post/1495326/


All Articles