My team makes some models for modeling medium-sized pieces of text (tens of thousands of words) using the Quanteda package in R. I would like to reduce the words to phrases before the process of modeling the theme, so that I do not consider variations on the same word as different themes.
The only problem is that the crowding-out algorithm leaves some words that are not really words. “Happiness” is associated with “happy,” “arranges” the basis for “arang,” etc. Therefore, before visualizing the results of modeling the topic, I would like to restore the stems to complete the words.
Looking through some previous threads here in StackOverflow, I came across a function, stemCompletion (), from the TM package, which does this at least roughly . It seems to work quite well.
But when I apply it to the condition vector in the text matrix of the document, stemCompletion () always replaces the names of the symbol vector, not the symbols themselves. Here's a reproducible example:
# Set up libraries library(janeaustenr) library(quanteda) library(tm)
I am looking for a way to get the results returned by stemCompletion () in a character vector, and not in the character vector name attribute. Any understanding of this problem is greatly appreciated.
source share