R tm: reloading the 'PCorpus' file archive database as corpus (for example, in a restarted session / script)

Question

R tm: reloading the 'PCorpus' file archive database as corpus (for example, in a restarted session / script)

Having studied the load from the answers on this site (thanks!), Finally, it's time to ask your question.

I use R (tm and lsa packages) to create, clean, and simplify, and then run LSA (hidden semantic analysis) on the body of about 15,000 text documents. I do this in R 3.0.0 under Mac OS X 10.6.

For efficiency (and in order to have too little RAM), I tried to use either "PCorpus" (database database support supported by the "filehash" package) in tm, or the newer "tm. Plugin.dc 'for so called “distributed” processing of the hulls.) But I really don’t understand how one of them works under the hood.

An obvious error using DCorpus with tm_map (currently not relevant) forced me to do some of the preprocessing with the PCorpus option. And it takes many hours. Therefore, I use R CMD BATCH to run a script to perform actions such as:

> # load corpus from predefined directory path, > # and create backend database to support processing: > bigCcorp = PCorpus(bigCdir, readerControl = list(load=FALSE), dbControl = list(useDb = TRUE, dbName = "bigCdb", dbType = "DB1")) > # converting to lower case: > bigCcorp = tm_map(bigCcorp, tolower) > # removing stopwords: > stoppedCcorp = tm_map(bigCcorp, removeWords, stoplist)

Now, suppose my script crashes shortly after this point or I just forget to export the enclosure in some other form, and then restart R. The database is still on my data hard drive. Of course, can I reload it into a new R session to continue processing the case instead of starting all over again?

This seems like a noodle question ... but no amount of dbInit () or dbLoad () or variations of the "PCorpus ()" function seems to work. Does anyone know the correct spell?

I looked through all the documentation related to it, and every paper and web forum I can find, but a complete gap - no one seems to have done it. Or did I miss this?

+4

database r text-mining tm corpus

R_klutz May 29, '13 at 21:02

source share

1 answer

knb · Answer 1 · 2015-04-09T08:17:57+0000

The original question has been since 2013. Meanwhile, in February 2015, an answer was given to a duplicate or a similar question:

How to connect to PCorpus in R tm package? . This answer on this post is very important, although quite minimalist, so I will try to expand it here.

Here are some comments that I just discovered while working on a similar issue:

Note that the dbInit() function is not part of the tm package.

First you need to install the filehash package, which offers tm -Documentation only for installation. This means that this is not a tm hard dependency.

Presumably, you can also use the filehashSQLite package with library("filehashSQLite") instead of library("filehash") , and both of these packages have the same interface and work together seamlessly due to the object-oriented design. Also install "filehashSQLite" (edit 2016: some functions, such as tn :: content_transformer (), are not implemented for filehashSQLite).

then this works:

 library(filehashSQLite) # this string becomes filename, must not contain dots. # Example: "mydata.sqlite" is not permitted. s <- "sqldb_pcorpus_mydata" #replace mydat with something more descriptive suppressMessages(library(filehashSQLite)) if(! file.exists(s)){ # csv is a data frame of 900 documents, 18 cols/features pc = PCorpus(DataframeSource(csv), readerControl = list(language = "en"), dbControl = list(dbName = s, dbType = "SQLite")) dbCreate(s, "SQLite") db <- dbInit(s, "SQLite") set.seed(234) # add another record, just to show we can. # key="test", value = "Hi there" dbInsert(db, "test", "hi there") } else { db <- dbInit(s, "SQLite") pc <- dbLoad(db) } show(pc) # <<PCorpus>> # Metadata: corpus specific: 0, document level (indexed): 0 #Content: documents: 900 dbFetch(db, "test") # remove it rm(db) rm(pc) #reload it db <- dbInit(s, "SQLite") pc <- dbLoad(db) # the corpus entries are now accessible, but not loaded into memory. # now 900 documents are bound via "Active Bindings", created by makeActiveBinding() from the base package show(pc) # [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" # ... # [900] #[883] "883" "884" "885" "886" "887" "888" "889" "890" "891" "892" #"893" "894" "895" "896" "897" "898" "899" "900" #[901] "test" dbFetch(db, "900") # <<PlainTextDocument>> # Metadata: 7 # Content: chars: 33 dbFetch(db, "test") #[1] "hi there"

This is what the database backend looks like. You can see that the documents from the data frame were somehow encoded inside the sqlite table.

This is what the RStudio IDE shows me:

R tm: reloading the 'PCorpus' file archive database as corpus (for example, in a restarted session / script)

More articles: