What if you created the removeCommonTerms function
removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] }
Then, if you want to remove terms that make up> = 80% of the documents, you could do
data("crude") dtm <- DocumentTermMatrix(crude) dtm # <<DocumentTermMatrix (documents: 20, terms: 1266)>> # Non-/sparse entries: 2255/23065 # Sparsity : 91% # Maximal term length: 17 # Weighting : term frequency (tf) removeCommonTerms(dtm ,.8) # <<DocumentTermMatrix (documents: 20, terms: 1259)>> # Non-/sparse entries: 2129/23051 # Sparsity : 92% # Maximal term length: 17 # Weighting : term frequency (tf)
source share