R: find the most common group of words in the corpus

Is there an easy way to find not only the most common terms, but also expressions (like several words, groups of words) in a text corpus in R?

Using the tm package, I can find the most commonly used terms, for example:

tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq=3, highfreq=Inf)

I can find related words with the most common words using a function findAssocs(), so I could manually group these words. But how can I find the number of occurrences of these groups of words in the corpus?

thank

+4
source share
1 answer

, TermDocumentMatrix Bigrams (2 , ) weka,

library("tm") #text mining
library("RWeka") # for tokenization algorithms more complicated than single-word


BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))

# process tdm 
# findFreqTerms(tdm, lowfreq=3, highfreq=Inf)
# ...

tdm <- removeSparseTerms(tdm, 0.99)
print("----")
print("tdm properties")
str(tdm)
tdm_top_N_percent = tdm$nrow / 100 * topN_percentage_wanted

,

#words combinations that occur at least once together an at most 5 times
wmin=1
wmax = 5

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = wmin, max = wmax))

, , "" .

+4

Source: https://habr.com/ru/post/1540526/


All Articles