Use tm Corpus function with big data in R

I am trying to do text processing of big data in R using tm .

I often run into memory problems (e.g., can not allocation vector of size.... ) and use the established methods to fix these problems, e.g.

  • using 64 bit R
  • trying to use another OS (Windows, Linux, Solaris, etc.)
  • setting memory.limit() to the maximum
  • make sure that enough RAM is available on the server and calculate (what is)
  • liberal use of gc()
  • code profiling for bottlenecks
  • splitting large operations into several smaller operations

However, when I try to start Corpus in a vector of a million text fields, I encounter a slightly different memory error than usual, and I'm not sure how to get around this problem. Error:

 > ds <- Corpus(DataframeSource(dfs)) Error: memory exhausted (limit reached?) 

Can (and should) run Corpus in stages on line blocks from this source framework, and then combine the results? Is there a more efficient way to run this?

The size of the data that will cause this error depends on the computer running it, but if you take the embedded crude dataset and replicate the documents until it is large enough, you can replicate the error.

UPDATE

I experimented with trying to combine a smaller corpa, i.e.

 test1 <- dfs[1:10000,] test2 <- dfs[10001:20000,] ds.1 <- Corpus(DataframeSource(test1)) ds.2 <- Corpus(DataframeSource(test2)) 

and until I was successful, I discovered tm_combine , which should solve this exact problem . The only catch is that for some reason my 64-bit build of R 3.1.1 with the latest version of tm cannot find the tm_combine function. Maybe for some reason it was removed from the package? I'm investigating ...

 > require(tm) > ds.12 <- tm_combine(ds.1,ds.2) Error: could not find function "tm_combine" 
+6
source share
1 answer

I don’t know if tm_combine obsolete or why it was not found in the tm namespace, but I found a solution using Corpus on smaller fragments of the data frame, then combining them.

fooobar.com/questions/892269 / ...

 test1 <- dfs[1:100000,] test2 <- dfs[100001:200000,] ds.1 <- Corpus(DataframeSource(test1)) ds.2 <- Corpus(DataframeSource(test2)) #ds.12 <- tm_combine(ds.1,ds.2) ##Error: could not find function "tm_combine" ds.12 <- c(ds.1,ds.2) 

which gives you:

ds.12

 <<VCorpus (documents: 200000, metadata (corpus/indexed): 0/0)>> 

Sorry for not explaining this on your own before asking. I tried and failed with other ways of combining objects.

+2
source

Source: https://habr.com/ru/post/974427/


All Articles