In both models (i.e., DistributedLDAModel
and LocalLDAMoel
) the method topicsMatrix
, I believe, will return (approximately, there is a little regularization due to the previously described Dirichlet topics) the expected matrix of the word counter. To check this, you can take this matrix and sum all the columns. The resulting vector (length length of the theme-size) should be approximately equal to the word count (for all your documents). In any case, in order to get topics (probability distributions according to words in the dictionary), you need to normalize the columns of the matrix returned topicsMatrix
, so that each sum is 1.
I have not tested it completely, but something like this should work to normalize the columns of the matrix returned topicsMatrix
:
import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.mllib.linalg._
def normalizeColumns(m: Matrix): DenseMatrix = {
val bm = Matrices.toBreeze(m).toDenseMatrix
val columnSums = BDV.zeros[Double](bm.cols).t
var i = bm.rows
while (i > 0) { i -= 1; columnSums += bm(i, ::) }
i = bm.cols
while (i > 0) { i -= 1; bm(::, i) /= columnSums(i) }
new DenseMatrix(bm.rows, bm.cols, bm.data)
}
source
share