Filter rows / documents from Document-Term-Matrix to R

Question

Filter rows / documents from Document-Term-Matrix to R

Using the tm package in R, I create a matrix of document-days:

dtm <- DocumentTermMatrix(cor, control = list(dictionary=c("someTerm")))

The result is something like this:

 A document-term matrix (291 documents, 1 terms) Non-/sparse entries: 48/243 Sparsity : 84% Maximal term length: 8 Weighting : term frequency (tf) Terms Docs someTerm doc1 0 doc2 0 doc3 7 doc4 22 doc5 0

Now I would like to filter out this Document-Term-Matrix according to the number of someTerm instances in the documents. For instance. filter only those documents where someTerm appears at least once. That is doc3 and doc4.

How can i achieve this?

+6

matrix r text-mining tm

user3316599 Jun 14 '14 at 21:07

source share

2 answers

Alternatively, you can use the removeSparseTerms function, which removes empty elements (check out the documentation here ).

 dtm <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum

+1

ElenaZhebel Mar 20 '17 at 12:37

source share

James king · Accepted Answer · 2014-06-14T21:33:54+0000

This is very similar to how you multiply a regular R-matrix. For example, to create a term matrix of a document from an example of a Reuters dataset with only rows where the term “will” appears more than once:

 reut21578 <- system.file("texts", "crude", package = "tm") reuters <- VCorpus(DirSource(reut21578), readerControl = list(reader = readReut21578XMLasPlain)) dtm <- DocumentTermMatrix(reuters) v <- as.vector(dtm[,"would"]>1) dtm2 <- dtm[v, ] > inspect(dtm2[, "would"]) A document-term matrix (3 documents, 1 terms) Non-/sparse entries: 3/0 Sparsity : 0% Maximal term length: 5 Weighting : term frequency (tf) Terms Docs would 246 2 489 2 502 2

A tm document document matrix is a simple triple matrix from the slam package, so the slam documentation helps in determining how to manipulate dtms.

Filter rows / documents from Document-Term-Matrix to R

More articles: