Creating a Sparse Matrix from TermDocumentMatrix

Question

Creating a Sparse Matrix from TermDocumentMatrix

I created TermDocumentMatrixfrom a library tmin R. It looks something like this:

> inspect(freq.terms)

A document-term matrix (19 documents, 214 terms)

Non-/sparse entries: 256/3810
Sparsity           : 94%
Maximal term length: 19 
Weighting          : term frequency (tf)

Terms
Docs abundant acid active adhesion aeropyrum alternative
  1         0    0      1        0         0           0
  2         0    0      0        0         0           0
  3         0    0      0        1         0           0
  4         0    0      0        0         0           0
  5         0    0      0        0         0           0
  6         0    1      0        0         0           0
  7         0    0      0        0         0           0
  8         0    0      0        0         0           0
  9         0    0      0        0         0           0
  10        0    0      0        0         1           0
  11        0    0      1        0         0           0
  12        0    0      0        0         0           0
  13        0    0      0        0         0           0
  14        0    0      0        0         0           0
  15        1    0      0        0         0           0
  16        0    0      0        0         0           0
  17        0    0      0        0         0           0
  18        0    0      0        0         0           0
  19        0    0      0        0         0           1

This is just a small sample of the matrix; there are actually 214 terms that I work with. On a small scale, this is normal. If I want to convert mine TermDocumentMatrixto a regular matrix, I would do:

data.matrix <- as.matrix(freq.terms)

However, the data that I showed above is just a subset of my common data. My general data is probably at least 10,000 terms. When I try to create TDM from shared data, I run an error:

> Error cannot allocate vector of size n Kb

So, I am considering alternative ways to find efficient memory allocation for my tdm.

I tried turning my tdm into a sparse matrix from a library Matrix, but ran into the same problem.

? , :

bigmemory/ff , ( bigmemory Windows )
irlba SVD tdm,

, , , - . - , ? , , , , , , .

EDIT: 10,00 10 000. @nograpes.

+4

r sparse-matrix tm term-document-matrix

user1988898 10 . '14 21:13

1

Tyler Rinker · Answer 1 · 2014-02-26T20:11:02+0000

qdap, , . - , OP, . qdap version 1.1.0 tm:

library(qdapDictionaries)

FUN <- function() {
   paste(sample(DICTIONARY[, 1], sample(seq(100, 10000, by=1000), 1, TRUE)), collapse=" ")
}

library(qdap)
mycorpus <- tm::Corpus(tm::VectorSource(lapply(paste0("doc", 1:15), function(i) FUN())))

...

qdap. Corpus dataframe (tm_corpus2df), tdm TermDocumentMatrix.

out <- with(tm_corpus2df(mycorpus), tdm(text, docs))
tm::inspect(out)

## A term-document matrix (19914 terms, 15 documents)
## 
## Non-/sparse entries: 80235/218475
## Sparsity           : 73%
## Maximal term length: 19 
## Weighting          : term frequency (tf)

Creating a Sparse Matrix from TermDocumentMatrix

More articles: