A call transformto the model LatentDirichletAllocationcauses an abnormal distribution of document topics. To get the right probabilities, you can simply normalize the result. Here is an example:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import numpy as np
dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'))
train,test = dataset.data[:100], dataset.data[100:200]
tf_vectorizer = TfidfVectorizer(max_features=25)
X_train = tf_vectorizer.fit_transform(train)
lda = LatentDirichletAllocation(n_topics=5)
lda.fit(X_train)
X_test = tf_vectorizer.transform(test)
doc_topic_dist_unnormalized = np.matrix(lda.transform(X_test))
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)
To find a top ranking topic, you can do something like:
doc_topic_dist.argmax(axis=1)
source
share