R: removeCommonTerms with Quanteda package?

The removeCommonTerms function is found here for the TM package so that

removeCommonTerms <- function (x, pct) { stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), is.numeric(pct), pct > 0, pct < 1) m <- if (inherits(x, "DocumentTermMatrix")) t(x) else x t <- table(m$i) < m$ncol * (pct) termIndex <- as.numeric(names(t[t])) if (inherits(x, "DocumentTermMatrix")) x[, termIndex] else x[termIndex, ] } 

Now I would like to remove terms that are too general with the Quanteda package. I could do this deletion before creating the Document property matrix or using the document property matrix.

How to remove too general terms with the Quanteda package in R?

+1
source share
1 answer

You need the dfm_trim function. From ?dfm_trim

max_docfreq maximum number or part of documents in which the function appears, over which the functions will be deleted. (There is no upper limit by default.)

This requires the latest version of quanteda (new on CRAN).

 packageVersion("quanteda") ## [1] '0.9.9.3' inaugdfm <- dfm(data_corpus_inaugural) dfm_trim(inaugdfm, max_docfreq = .8) ## Removing features occurring: ## - in more than 0.8 * 57 = 45.6 documents: 93 ## Total features removed: 93 (1.01%). ## Document-feature matrix of: 57 documents, 9,081 features (92.4% sparse). 
+2
source

Source: https://habr.com/ru/post/1202841/


All Articles