Stem replenishment in R replaces names, not data

My team makes some models for modeling medium-sized pieces of text (tens of thousands of words) using the Quanteda package in R. I would like to reduce the words to phrases before the process of modeling the theme, so that I do not consider variations on the same word as different themes.

The only problem is that the crowding-out algorithm leaves some words that are not really words. “Happiness” is associated with “happy,” “arranges” the basis for “arang,” etc. Therefore, before visualizing the results of modeling the topic, I would like to restore the stems to complete the words.

Looking through some previous threads here in StackOverflow, I came across a function, stemCompletion (), from the TM package, which does this at least roughly . It seems to work quite well.

But when I apply it to the condition vector in the text matrix of the document, stemCompletion () always replaces the names of the symbol vector, not the symbols themselves. Here's a reproducible example:

# Set up libraries library(janeaustenr) library(quanteda) library(tm) # Get first 200 words of Mansfield Park words <- head(mansfieldpark, 200) # Build a corpus from words corpus <- quanteda::corpus(words) # Eliminate some words from counting process STOPWORDS <- c("the", "and", "a", "an") # Create a document text matrix and do topic modeling dtm <- corpus %>% quanteda::dfm(remove_punct = TRUE, remove = STOPWORDS) %>% quanteda::dfm_wordstem(.) %>% # Word stemming takes place here quanteda::convert("topicmodels") # Word stems are now stored in dtm$dimnames$Terms # View a sample of stemmed terms tail(dtm$dimnames$Terms, 20) # View the structure of dtm$dimnames$Terms (It just a character vector) str(dtm$dimnames$Terms) # Apply tm::stemCompletion to Terms unstemmed_terms <- tm::stemCompletion(dtm$dimnames$Terms, dictionary = words, # or corpus type = "shortest") # Result is composed entirely of NAs, with the values stored as names! str(unstemmed_terms) tail(unstemmed_terms, 20) 

I am looking for a way to get the results returned by stemCompletion () in a character vector, and not in the character vector name attribute. Any understanding of this problem is greatly appreciated.

+5
source share
1 answer

The problem is that your dictionary argument for tm::stemCompletion() not a character vector of words (or tm Corpus object), but rather a set of lines from an Austin novel.

 tail(words) # [1] "most liberal-minded sister and aunt in the world." # [2] "" # [3] "When the subject was brought forward again, her views were more fully" # [4] "explained; and, in reply to Lady Bertram calm inquiry of \"Where shall" # [5] "the child come to first, sister, to you or to us?\" Sir Thomas heard with" # [6] "some surprise that it would be totally out of Mrs. Norris power to" 

But this can be easily faked using tokens() quantization and converted to a character vector.

 unstemmed_terms <- tm::stemCompletion(dtm$dimnames$Terms, dictionary = as.character(tokens(words, remove_punct = TRUE)), type = "shortest") tail(unstemmed_terms, 20) # arrang chariti perhap parsonag convers happi # "arranging" NA "perhaps" NA "conversation" "happily" # belief most liberal-mind aunt again view # "belief" "most" "liberal-minded" "aunt" "again" "views" # explain calm inquiri where come heard # "explained" "calm" NA NA "come" "heard" # surpris total # "surprise" "totally" 
+4
source

Source: https://habr.com/ru/post/1276258/


All Articles