Build FROM Document-Term-Matrix Enclosure in R tm Package

Question

Build FROM Document-Term-Matrix Enclosure in R tm Package

It is difficult to build a matrix of documents with a case with the tm package. I would like to build a corpus from a matrix of documents.

Let M be the number of documents in the set of documents. Let V be the number of members in the dictionary of this set of documents. Then the matrix of the document matrices is the matrix M * V.

I also have a dictionary of length V dictionary. There are words in the dictionary vector that are represented by indexes in a term-matrix document.

From dtm and the vocabulary vector, I would like to build a body object. This is because I would like to stop my set of documents. I created my dtm and vocab manually - i.e. There has never been a tm "corpus" object representing my dataset, so I cannot use this function,

tm_map(corpus, stemDocument, language="english")

I tried to create a workaround in which I am based on the dictionary and save only unique words, but then it gets a little more complicated, trying to maintain a correspondence between dtm and the dictionary vector.

Ideally, the end result will be that my vocabulary vector holds and contains only unique entries, and the dtm indices correspond to the original vocabulary vector. If you can think of any other way to do this, I would appreciate it too.

My problems would be fixed if I could just build tm-corpus from my dtm and vocabulary vector, compose the case, and then convert back to dtm vector and dictionary (I already know how to make these conversions).

Let me know if I can clarify the problem.

+4

r text-mining tm corpus lda

sinwav 25 . '14 21:27

1

Tyler Rinker · Accepted Answer · 2014-06-25T21:47:48+0000

, , ( , , ) tm:

## Minimal Reproducible Example
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
    control = list(weighting =
    function(x)
        weightTfIdf(x, normalize = FALSE),
        stopwords = TRUE))

## Convert tdm to a list of text
dtm2list <- apply(dtm, 1, function(x) {
    paste(rep(names(x), x), collapse=" ")
})

## convert to a Corpus
myCorp <- VCorpus(VectorSource(dtm2list))
inspect(myCorp)

## Stemming
myCorp <- tm_map(myCorp, stemDocument)
inspect(myCorp)

Build FROM Document-Term-Matrix Enclosure in R tm Package

More articles: