Hidden Dirichlet Distribution (LDA) in Sparks

I am trying to write a program in Spark to perform the allocation of the hidden Dirichlet distribution (LDA). This reference documentation Spark page is a good example for creating LDA on sample data. Below is the program

from pyspark.mllib.clustering import LDA, LDAModel from pyspark.mllib.linalg import Vectors # Load and parse the data data = sc.textFile("data/mllib/sample_lda_data.txt") parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')])) # Index documents with unique IDs corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache() # Cluster the documents into three topics using LDA ldaModel = LDA.train(corpus, k=3) # Output topics. Each is a distribution over words (matching word count vectors) print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):") topics = ldaModel.topicsMatrix() for topic in range(3): print("Topic " + str(topic) + ":") for word in range(0, ldaModel.vocabSize()): print(" " + str(topics[word][topic])) # Save and load model ldaModel.save(sc, "target/org/apache/spark/PythonLatentDirichletAllocationExample/LDAModel") sameModel = LDAModel\ .load(sc, "target/org/apache/spark/PythonLatentDirichletAllocationExample/LDAModel") 

The sample used (sample_lda_data.txt) is used below

 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 

How do I change the program to run in a text data file containing text data instead of numbers? Let the sample file contain the following text.

Dirichlet's Hidden Distribution (LDA) is a topic model that is topics from a collection of text documents. LDA can be considered as a clustering algorithm as follows:

Topics correspond to cluster centers, and documents correspond to examples (rows) in a data set. Themes and documents exist in the space of objects, where the feature vectors are word count vectors (bag of words). Instead of evaluating clustering using a traditional distance, the LDA uses a feature based on a statistical model of how text documents are created.

+5
source share
1 answer

After some research, I try to answer this question. Below is an example of code for executing LDA in a text document with real text data using Spark.

 from pyspark.sql import SQLContext, Row from pyspark.ml.feature import CountVectorizer from pyspark.mllib.clustering import LDA, LDAModel from pyspark.mllib.linalg import Vector, Vectors path = "sample_text_LDA.txt" data = sc.textFile(path).zipWithIndex().map(lambda (words,idd): Row(idd= idd, words = words.split(" "))) docDF = spark.createDataFrame(data) Vector = CountVectorizer(inputCol="words", outputCol="vectors") model = Vector.fit(docDF) result = model.transform(docDF) corpus = result.select("idd", "vectors").rdd.map(lambda (x,y): [x,Vectors.fromML(y)]).cache() # Cluster the documents into three topics using LDA ldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online') topics = ldaModel.topicsMatrix() vocabArray = model.vocabulary wordNumbers = 10 # number of words per topic topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers)) def topic_render(topic): # specify vector id of words to actual words terms = topic[0] result = [] for i in range(wordNumbers): term = vocabArray[terms[i]] result.append(term) return result topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect() for topic in range(len(topics_final)): print ("Topic" + str(topic) + ":") for term in topics_final[topic]: print (term) print ('\n') 

Topics extracted from textual data as indicated in the question are given below:

enter image description here

+5
source

Source: https://habr.com/ru/post/1263833/


All Articles