R text mining: allow the inclusion of new documents in an existing corpus

I was interested to know if there is any chance of an R text mining package having the following function:

myCorpus <- Corpus(DirSource(<directory-contatining-textfiles>),control=...) # add docs myCorpus.addDocs(DirSource(<new-dir>),control=...) 

Ideally, I would like to include additional documents in the existing building.

Any help is appreciated

+6
source share
2 answers

You should just use c(,) , as in

 > library(tm) > data("acq") > data("crude") > together <- c(acq,crude) > acq A corpus with 50 text documents > crude A corpus with 20 text documents > together A corpus with 70 text documents 

For more information, see the tm package documentation in the tm_combine section.

+11
source

I also overcome this problem in the context of large sets of predictive data text. It is not possible to download the entire data set immediately.

Here another option is possible for such large data sets. The approach is to collect the vector of one corpus document within a loop. After processing all the documents like this, you can convert this vector into one huge building, for example. to create a DTM on it.

 # Vector to collect the corpora: webCorpusCollection <- c() # Loop over raw data: for(i in ...) { try({ # Convert one document into a corpus: webDocument <- Corpus(VectorSource(iconv(webDocuments[i,1], "latin1", "UTF-8"))) # # Do other things eg preprocessing... # # Store this document into the corpus vector: webCorpusCollection <- rbind(webCorpusCollection, webDocument) }) } # Collecting done. Create one huge corpus: webCorpus <- Corpus(VectorSource(unlist(webCorpusCollection[,"content"]))) 
0
source

Source: https://habr.com/ru/post/892269/


All Articles