Removing words that are too common (found in more than 80% of documents) in R

I work with the 'tm' package to create a shell. I have completed most of the preprocessing steps. The rest is to delete too common words (terms that are found in more than 80% of documents). Can someone help me with this?

dsc <- Corpus(dd) dsc <- tm_map(dsc, stripWhitespace) dsc <- tm_map(dsc, removePunctuation) dsc <- tm_map(dsc, removeNumbers) dsc <- tm_map(dsc, removeWords, otherWords1) dsc <- tm_map(dsc, removeWords, otherWords2) dsc <- tm_map(dsc, removeWords, otherWords3) dsc <- tm_map(dsc, removeWords, javaKeywords) dsc <- tm_map(dsc, removeWords, stopwords("english")) dsc = tm_map(dsc, stemDocument) dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE)) dtm = removeSparseTerms(dtm, 0.99) # ^- Removes overly rare words (occur in less than 2% of the documents) 
+5
source share
2 answers

What if you created the removeCommonTerms function

 removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] } 

Then, if you want to remove terms that make up> = 80% of the documents, you could do

 data("crude") dtm <- DocumentTermMatrix(crude) dtm # <<DocumentTermMatrix (documents: 20, terms: 1266)>> # Non-/sparse entries: 2255/23065 # Sparsity : 91% # Maximal term length: 17 # Weighting : term frequency (tf) removeCommonTerms(dtm ,.8) # <<DocumentTermMatrix (documents: 20, terms: 1259)>> # Non-/sparse entries: 2129/23051 # Sparsity : 92% # Maximal term length: 17 # Weighting : term frequency (tf) 
+11
source

If you intend to use DocumentTermMatrix, then an alternative approach is to use the $ global bounds bounds option. For instance:

 ndocs <- length(dcs) # ignore overly sparse terms (appearing in less than 1% of the documents) minDocFreq <- ndocs * 0.01 # ignore overly common terms (appearing in more than 80% of the documents) maxDocFreq <- ndocs * 0.8 dtm<- DocumentTermMatrix(dsc, control = list(bounds = list(global = c(minDocFreq, maxDocFreq))) 
+5
source

Source: https://habr.com/ru/post/1202839/


All Articles