I started using sklearn.naive_bayes.GaussianNB to classify the text and got great initial results. I want to use the probability returned by the classifier as a measure of confidence, but the pred_proba () method always returns "1.0" for the selected class and "0.0" for everyone else.
I know (from here ) that "... the probability exits from pred_proba should not be taken too seriously," but for that degree ?! The classifier may mistakenly accept investment or chord strings, but the output of predict_proba () does not show any fluctuations ...
A little about the context:
- I used sklearn.feature_extraction.text.TfidfVectorizer to extract a function, without, for starters, restricting vocabulary using stop_words or min / max_df β I get very large vectors.
- I studied the classifier in a hierarchical tree of categories (shallow: no more than 3 layers) with 7 texts (manually classified) for each category. Currently training is flat : I do not take hierarchy into account.
The resulting GaussianNB object is very large (~ 300 MB), and the prediction is rather slow: about 1 second for a single text.
Could this be related? Are huge vectors at the root of all this?
How do I get meaningful forecasts? Do I need to use a different classifier?
Here is the code I'm using:
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.externals import joblib Vectorizer = TfidfVectorizer(input = 'content') vecs = Vectorizer.fit_transform(TextsList) # ~2000 strings joblib.dump(Vectorizer, 'Vectorizer.pkl') gnb = GaussianNB() Y = np.array(TargetList) # ~2000 categories gnb.fit(vecs.toarray(), Y) joblib.dump(gnb, 'Classifier.pkl') ... #In a different function: Vectorizer = joblib.load('Vectorizer.pkl') Classifier = joblib.load('Classifier.pkl') InputList = [Text] # One string Vec = Vectorizer.transform(InputList) Probs = Classifier.predict_proba([Vec.toarray()[0]])[0] MaxProb = max(Probs) MaxProbIndex = np.where(Probs==MaxProb)[0][0] Category = Classifier.classes_[MaxProbIndex] result = (Category, MaxProb)
Update:
Following the tips below, I tried MultinomialNB and LogisticRegression . They both return different probabilities and are better suited for my task: a much more accurate classification, smaller objects in memory and a much higher speed ( MultinomialNB - lightning fast!).
Now I have a new problem: the returned probabilities are very small - usually in the range of 0.004-0.012. This belongs to the forecast / win category (and the classification is accurate).