How to get the theme associated with each document using pyspark (2.1.0) LdA?

I am using LDAModel for pyspark to get themes from the enclosure. My goal is to find topics related to each document . For this, I tried to set topicDistributionCol according to the Docs. Since I'm new to this, I'm not sure what the purpose of this column is.

from pyspark.ml.clustering import LDA
lda_model = LDA(k=10, optimizer="em").setTopicDistributionCol("topicDistributionCol")
// documents is valid dataset for this lda model
lda_model = lda_model.fit(documents)
transformed = lda_model.transform(documents)

topics = lda_model.describeTopics(maxTermsPerTopic=num_words_per_topic)
print("The topics described by their top-weighted terms:")
print topics.show(truncate=False)

It lists all topics with termIndices and termWeights.

enter image description here

below code will give me topicDistributionCol. Here is each line for each document.

print transformed.select("topicDistributionCol").show(truncate=False)

enter image description here

I want to get a matrix of document formatting topics. Is this possible with the Lys pysparks model?

doc | topic 
1   |  [2,4]
2   |  [3,4,6]

Note. I did this using the Gensims LDA model earlier with the following code. But I need to use the LDA pysparks model.

texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]
doc_topics = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, passes=10)
## to fetch topics for one document
vec_bow = dictionary.doc2bow(text[0])
Topics = doc_topics[vec_bow]
Topic_list = [x[0] for x in Topics]
## topic list is [1,5]
+4
1

, . :

transformed.take(10)

"topicDistribution", .

0

Source: https://habr.com/ru/post/1668395/


All Articles