Filter rows / documents from Document-Term-Matrix to R

Using the tm package in R, I create a matrix of document-days:

dtm <- DocumentTermMatrix(cor, control = list(dictionary=c("someTerm"))) 

The result is something like this:

 A document-term matrix (291 documents, 1 terms) Non-/sparse entries: 48/243 Sparsity : 84% Maximal term length: 8 Weighting : term frequency (tf) Terms Docs someTerm doc1 0 doc2 0 doc3 7 doc4 22 doc5 0 

Now I would like to filter out this Document-Term-Matrix according to the number of someTerm instances in the documents. For instance. filter only those documents where someTerm appears at least once. That is doc3 and doc4.

How can i achieve this?

+6
source share
2 answers

This is very similar to how you multiply a regular R-matrix. For example, to create a term matrix of a document from an example of a Reuters dataset with only rows where the term β€œwill” appears more than once:

 reut21578 <- system.file("texts", "crude", package = "tm") reuters <- VCorpus(DirSource(reut21578), readerControl = list(reader = readReut21578XMLasPlain)) dtm <- DocumentTermMatrix(reuters) v <- as.vector(dtm[,"would"]>1) dtm2 <- dtm[v, ] > inspect(dtm2[, "would"]) A document-term matrix (3 documents, 1 terms) Non-/sparse entries: 3/0 Sparsity : 0% Maximal term length: 5 Weighting : term frequency (tf) Terms Docs would 246 2 489 2 502 2 

A tm document document matrix is ​​a simple triple matrix from the slam package, so the slam documentation helps in determining how to manipulate dtms.

+6
source

Alternatively, you can use the removeSparseTerms function, which removes empty elements (check out the documentation here ).

 dtm <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum 
+1
source

Source: https://habr.com/ru/post/970847/


All Articles