Interpretation of the results of the LDA Spark MLLib

Question

Interpretation of the results of the LDA Spark MLLib

I launched the LDA into a spark for a set of documents and noticed that the topicMatrix values, which represent the distribution of the topic in terms, are more than 1, for example 548.2201, 685.2436, 138.4013 ... What do these values mean? Are they logarithmic distribution values or something. How to convert these values to probability distribution values. Thank you in advance.

+4

apache-spark apache-spark-mllib lda

hari Oct 25 '15 at 14:17

source share

2 answers

, Matrix, scala

def formatSparkLDAWordOutput(wordTopMat: Matrix, wordMap: Map[Int, String]): scala.Predef.Map[String, Array[Double]] = {

// incoming word top matrix is in column-major order and the columns are unnormalized
val m = wordTopMat.numRows
val n = wordTopMat.numCols
val columnSums: Array[Double] = Range(0, n).map(j => (Range(0, m).map(i => wordTopMat(i, j)).sum)).toArray

val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
  .map(unnormProbs => unnormProbs.zipWithIndex.map({ case (u, j) => u / columnSums(j) }))

wordProbs.zipWithIndex.map({ case (topicProbs, wordInd) => (wordMap(wordInd), topicProbs) }).toMap

}

https://github.com/apache/incubator-spot/blob/v1.0-incubating/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala#L237

0

okwap 18 . '17 9:14

Jason lenderman · Accepted Answer · 2015-10-25T18:18:42+0000

In both models (i.e., DistributedLDAModeland LocalLDAMoel) the method topicsMatrix, I believe, will return (approximately, there is a little regularization due to the previously described Dirichlet topics) the expected matrix of the word counter. To check this, you can take this matrix and sum all the columns. The resulting vector (length length of the theme-size) should be approximately equal to the word count (for all your documents). In any case, in order to get topics (probability distributions according to words in the dictionary), you need to normalize the columns of the matrix returned topicsMatrix, so that each sum is 1.

I have not tested it completely, but something like this should work to normalize the columns of the matrix returned topicsMatrix:

import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.mllib.linalg._

def normalizeColumns(m: Matrix): DenseMatrix = {
  val bm = Matrices.toBreeze(m).toDenseMatrix
  val columnSums = BDV.zeros[Double](bm.cols).t
  var i = bm.rows
  while (i > 0) { i -= 1; columnSums += bm(i, ::) }
  i = bm.cols
  while (i > 0) { i -= 1; bm(::, i) /= columnSums(i) }
  new DenseMatrix(bm.rows, bm.cols, bm.data)
}

Interpretation of the results of the LDA Spark MLLib

More articles: