Interpretation of the results of the LDA Spark MLLib

I launched the LDA into a spark for a set of documents and noticed that the topicMatrix values, which represent the distribution of the topic in terms, are more than 1, for example 548.2201, 685.2436, 138.4013 ... What do these values ​​mean? Are they logarithmic distribution values ​​or something. How to convert these values ​​to probability distribution values. Thank you in advance.

+4
source share
2 answers

In both models (i.e., DistributedLDAModeland LocalLDAMoel) the method topicsMatrix, I believe, will return (approximately, there is a little regularization due to the previously described Dirichlet topics) the expected matrix of the word counter. To check this, you can take this matrix and sum all the columns. The resulting vector (length length of the theme-size) should be approximately equal to the word count (for all your documents). In any case, in order to get topics (probability distributions according to words in the dictionary), you need to normalize the columns of the matrix returned topicsMatrix, so that each sum is 1.

I have not tested it completely, but something like this should work to normalize the columns of the matrix returned topicsMatrix:

import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.mllib.linalg._

def normalizeColumns(m: Matrix): DenseMatrix = {
  val bm = Matrices.toBreeze(m).toDenseMatrix
  val columnSums = BDV.zeros[Double](bm.cols).t
  var i = bm.rows
  while (i > 0) { i -= 1; columnSums += bm(i, ::) }
  i = bm.cols
  while (i > 0) { i -= 1; bm(::, i) /= columnSums(i) }
  new DenseMatrix(bm.rows, bm.cols, bm.data)
} 
+4
source

, Matrix, scala

def formatSparkLDAWordOutput(wordTopMat: Matrix, wordMap: Map[Int, String]): scala.Predef.Map[String, Array[Double]] = {

// incoming word top matrix is in column-major order and the columns are unnormalized
val m = wordTopMat.numRows
val n = wordTopMat.numCols
val columnSums: Array[Double] = Range(0, n).map(j => (Range(0, m).map(i => wordTopMat(i, j)).sum)).toArray

val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
  .map(unnormProbs => unnormProbs.zipWithIndex.map({ case (u, j) => u / columnSums(j) }))

wordProbs.zipWithIndex.map({ case (topicProbs, wordInd) => (wordMap(wordInd), topicProbs) }).toMap

}

https://github.com/apache/incubator-spot/blob/v1.0-incubating/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala#L237

0

Source: https://habr.com/ru/post/1613101/


All Articles