I am trying to write a program in Spark to perform the allocation of the hidden Dirichlet distribution (LDA). This reference documentation Spark page is a good example for creating LDA on sample data. Below is the program
from pyspark.mllib.clustering import LDA, LDAModel from pyspark.mllib.linalg import Vectors # Load and parse the data data = sc.textFile("data/mllib/sample_lda_data.txt") parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')])) # Index documents with unique IDs corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache() # Cluster the documents into three topics using LDA ldaModel = LDA.train(corpus, k=3) # Output topics. Each is a distribution over words (matching word count vectors) print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):") topics = ldaModel.topicsMatrix() for topic in range(3): print("Topic " + str(topic) + ":") for word in range(0, ldaModel.vocabSize()): print(" " + str(topics[word][topic])) # Save and load model ldaModel.save(sc, "target/org/apache/spark/PythonLatentDirichletAllocationExample/LDAModel") sameModel = LDAModel\ .load(sc, "target/org/apache/spark/PythonLatentDirichletAllocationExample/LDAModel")
The sample used (sample_lda_data.txt) is used below
1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0
How do I change the program to run in a text data file containing text data instead of numbers? Let the sample file contain the following text.
Dirichlet's Hidden Distribution (LDA) is a topic model that is topics from a collection of text documents. LDA can be considered as a clustering algorithm as follows:
Topics correspond to cluster centers, and documents correspond to examples (rows) in a data set. Themes and documents exist in the space of objects, where the feature vectors are word count vectors (bag of words). Instead of evaluating clustering using a traditional distance, the LDA uses a feature based on a statistical model of how text documents are created.
source share