I also overcome this problem in the context of large sets of predictive data text. It is not possible to download the entire data set immediately.
Here another option is possible for such large data sets. The approach is to collect the vector of one corpus document within a loop. After processing all the documents like this, you can convert this vector into one huge building, for example. to create a DTM on it.
# Vector to collect the corpora: webCorpusCollection <- c() # Loop over raw data: for(i in ...) { try({ # Convert one document into a corpus: webDocument <- Corpus(VectorSource(iconv(webDocuments[i,1], "latin1", "UTF-8"))) # # Do other things eg preprocessing... # # Store this document into the corpus vector: webCorpusCollection <- rbind(webCorpusCollection, webDocument) }) } # Collecting done. Create one huge corpus: webCorpus <- Corpus(VectorSource(unlist(webCorpusCollection[,"content"])))
source share