Removing words that are too common (found in more than 80% of documents) in R

Question

Removing words that are too common (found in more than 80% of documents) in R

I work with the 'tm' package to create a shell. I have completed most of the preprocessing steps. The rest is to delete too common words (terms that are found in more than 80% of documents). Can someone help me with this?

dsc <- Corpus(dd) dsc <- tm_map(dsc, stripWhitespace) dsc <- tm_map(dsc, removePunctuation) dsc <- tm_map(dsc, removeNumbers) dsc <- tm_map(dsc, removeWords, otherWords1) dsc <- tm_map(dsc, removeWords, otherWords2) dsc <- tm_map(dsc, removeWords, otherWords3) dsc <- tm_map(dsc, removeWords, javaKeywords) dsc <- tm_map(dsc, removeWords, stopwords("english")) dsc = tm_map(dsc, stemDocument) dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE)) dtm = removeSparseTerms(dtm, 0.99) # ^- Removes overly rare words (occur in less than 2% of the documents)

+5

r text-mining tm

Fawaz 18 sept. '14 at 5:55

source share

2 answers

If you intend to use DocumentTermMatrix, then an alternative approach is to use the $ global bounds bounds option. For instance:

 ndocs <- length(dcs) # ignore overly sparse terms (appearing in less than 1% of the documents) minDocFreq <- ndocs * 0.01 # ignore overly common terms (appearing in more than 80% of the documents) maxDocFreq <- ndocs * 0.8 dtm<- DocumentTermMatrix(dsc, control = list(bounds = list(global = c(minDocFreq, maxDocFreq)))

+5

trungnv Mar 29 '15 at 23:35

source share

Mrflick · Accepted Answer · 2014-09-18T06:10:25+0000

What if you created the removeCommonTerms function

 removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] }

Then, if you want to remove terms that make up> = 80% of the documents, you could do

 data("crude") dtm <- DocumentTermMatrix(crude) dtm # <<DocumentTermMatrix (documents: 20, terms: 1266)>> # Non-/sparse entries: 2255/23065 # Sparsity : 91% # Maximal term length: 17 # Weighting : term frequency (tf) removeCommonTerms(dtm ,.8) # <<DocumentTermMatrix (documents: 20, terms: 1259)>> # Non-/sparse entries: 2129/23051 # Sparsity : 92% # Maximal term length: 17 # Weighting : term frequency (tf)

Removing words that are too common (found in more than 80% of documents) in R

More articles: