I am trying to start LDA using topicmodels package in R. The example in the manual uses data from the Associated Press and works great. However, when I try to use my own data, I get topics whose terms are document names. I traced the problem with the fact that my document term matrix is ββtransposing the path should be (rows -> columns).
TDM Example:
str(AssociatedPress) List of 6 $ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ... $ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ... $ v : int [1:302031] 1 2 1 1 1 1 2 1 1 1 ... $ nrow : int 2246 $ ncol : int 10473 $ dimnames:List of 2 ..$ Docs : NULL ..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ... - attr(*, "Weighting")= chr [1:2] "term frequency" "tf" - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
While my TDM has terms as rows and Docs as columns:
List of 6 $ i : int [1:10489] 1 3 4 13 20 24 25 26 27 28 ... $ j : int [1:10489] 1 1 1 1 1 1 1 1 1 1 ... $ v : num [1:10489] 1 1 1 1 2 1 67 1 44 3 ... $ nrow : int 5903 $ ncol : int 9 $ dimnames:List of 2 ..$ Terms: chr [1:5903] "\u2439aa" "aars" "\u2439ab" "\u242dab" ... ..$ Docs : chr [1:9] "art111130.txt" "art111131.txt" "art111132.txt" "art111133.txt" ... - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix" - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
What causes LDA(art_tdm,3) to create topics based on document names, not documents. Is this a change in the code base of the tm package? I cannot imagine what I would do to cause this transposition in my code:
art_cor<-Corpus(DirSource(directory = "tmptxts")) art_tdm<-TermDocumentMatrix(art_cor)
Any help would be greatly appreciated.
source share