I am trying to do text processing of big data in R using tm .
I often run into memory problems (e.g., can not allocation vector of size.... ) and use the established methods to fix these problems, e.g.
- using 64 bit R
- trying to use another OS (Windows, Linux, Solaris, etc.)
- setting
memory.limit() to the maximum - make sure that enough RAM is available on the server and calculate (what is)
- liberal use of
gc() - code profiling for bottlenecks
- splitting large operations into several smaller operations
However, when I try to start Corpus in a vector of a million text fields, I encounter a slightly different memory error than usual, and I'm not sure how to get around this problem. Error:
> ds <- Corpus(DataframeSource(dfs)) Error: memory exhausted (limit reached?)
Can (and should) run Corpus in stages on line blocks from this source framework, and then combine the results? Is there a more efficient way to run this?
The size of the data that will cause this error depends on the computer running it, but if you take the embedded crude dataset and replicate the documents until it is large enough, you can replicate the error.
UPDATE
I experimented with trying to combine a smaller corpa, i.e.
test1 <- dfs[1:10000,] test2 <- dfs[10001:20000,] ds.1 <- Corpus(DataframeSource(test1)) ds.2 <- Corpus(DataframeSource(test2))
and until I was successful, I discovered tm_combine , which should solve this exact problem . The only catch is that for some reason my 64-bit build of R 3.1.1 with the latest version of tm cannot find the tm_combine function. Maybe for some reason it was removed from the package? I'm investigating ...
> require(tm) > ds.12 <- tm_combine(ds.1,ds.2) Error: could not find function "tm_combine"