LDA TopicModels listing numbers, not terms

Feel me as I am extremely new to this and working on a project for a course in a certification program.

I have a .csv dataset obtained by me by retrieving bibliometric records from Pubmed and Embase databases. There are 1034 lines. There are several columns, however, I am trying to create topic models from one column, the Abstract column, and some records do not have an abstract. I did some processing (deleting stop words, punctuation marks, etc.) and was able to fix words that occur more than 200 times, as well as create a list of frequently occurring terms by rank, and also start associations of words with the selected words. Thus, it seems that r sees the words themselves in an abstract field. My problem arises when I try to create thematic models using the topicmodels package. Here's a bit of code I'm using.

#including 1st 3 lines for reference options(header = FALSE, stringsAsFactors = FALSE, FileEncoding = "latin1") records <- read.csv("Combined.csv") AbstractCorpus <- Corpus(VectorSource(records$Abstract)) AbstractTDM <- TermDocumentMatrix(AbstractCorpus) library(topicmodels) library(lda) lda <- LDA(AbstractTDM, k = 8) (term <- terms(lda, 6)) term <- (apply(term, MARGIN = 2, paste, collapse = ",")) 

However, the conclusion to those that I get is as follows.

 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 [1,] "499" "733" "390" "833" "17" "413" "719" "392" [2,] "484" "655" "808" "412" "550" "881" "721" "61" [3,] "857" "299" "878" "909" "15" "258" "47" "164" [4,] "491" "672" "313" "1028" "126" "55" "375" "987" [5,] "734" "430" "405" "102" "13" "193" "83" "588" [6,] "403" "52" "489" "10" "598" "52" "933" "980" 

Why can't I see words here, not numbers?

In addition, the following code, which I mainly took from r PDF on topicmodels, really creates meaning for me, but themes are still numbers, not words, and it makes no sense to me.

 #using information from topicmodels paper library(tm) library(topicmodels) library(lda) AbstractTM <- list(VEM = LDA(AbstractTDM, k = 10, control = list(seed = 505)), VEM_fixed = LDA(AbstractTDM, k = 10, control = list(estimate.alpha = FALSE, seed = 505)), Gibbs = LDA(AbstractTDM, k = 10, method = "Gibbs", Control = list(seed = 505, burnin = 100, thin = 10, iter = 100)), CTM = CTM(AbstractTDM, k = 10, control = list(seed = 505, var = list(tol = 10^-4), em = list(tol = 10^-3)))) #To compare the fitted models we first investigate the α values of the models fitted with VEM and α estimated and with VEM and α fixed sapply(AbstractTM[1:2], slot, "alpha") #Find entropy sapply(AbstractTM, function(x)mean(apply(posterior(x)$topics, 1, function(z) - sum(z * log(z))))) #Find estimated topics and terms Topic <- topics(AbstractTM[["VEM"]], 1) Topic #find 5 most frequent terms for each topic Terms <- terms(AbstractTM[["VEM"]], 5) Terms[,1:5] 

Any thoughts on what could be the problem?

+5
source share
1 answer

Looking at topicmodels documentation, it seems that the LDA() function expects DocumentTermMatrix , not TermDocumentMatrix . Try creating the first with DocumentTermMatrix(AbstractCorpus) and see if this works.

+3
source

Source: https://habr.com/ru/post/1266804/


All Articles