Big Text Corpus crashes tm_map

I’ve been racking my brains over the past few days. I searched all the SO archives and tried the suggested solutions, but just couldn't get it to work. I have sets of txt documents in folders, such as 2000 06, 1995 -99, etc. And you want to start some basic operations with text search, such as creating a matrix of documents and a matrix of terms of documents and performing some operations based on on word matches. My script runs on a smaller package, however, when I try to use it with a larger package, it fails. I pasted the code into one operation with such a folder.

library(tm) # Framework for text mining. library(SnowballC) # Provides wordStem() for stemming. library(RColorBrewer) # Generate palette of colours for plots. library(ggplot2) # Plot word frequencies. library(magrittr) library(Rgraphviz) library(directlabels) setwd("/ConvertedText") txt <- file.path("2000 -06") docs<-VCorpus(DirSource(txt, encoding = "UTF-8"),readerControl = list(language = "UTF-8")) docs <- tm_map(docs, content_transformer(tolower), mc.cores=1) docs <- tm_map(docs, removeNumbers, mc.cores=1) docs <- tm_map(docs, removePunctuation, mc.cores=1) docs <- tm_map(docs, stripWhitespace, mc.cores=1) docs <- tm_map(docs, removeWords, stopwords("SMART"), mc.cores=1) docs <- tm_map(docs, removeWords, stopwords("en"), mc.cores=1) #corpus creation complete setwd("/ConvertedText/output") dtm<-DocumentTermMatrix(docs) tdm<-TermDocumentMatrix(docs) m<-as.matrix(dtm) write.csv(m, file="dtm.csv") dtms<-removeSparseTerms(dtm, 0.2) m1<-as.matrix(dtms) write.csv(m1, file="dtms.csv") # matrix creation/storage complete freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE) wf <- data.frame(word=names(freq), freq=freq) freq[1:50] #adjust freq score in next line p <- ggplot(subset(wf, freq>100), aes(word, freq))+ geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45, hjust=1)) ggsave("frequency2000-06.png", height=12,width=17, dpi=72) # frequency graph generated x<-as.matrix(findFreqTerms(dtm, lowfreq=1000)) write.csv(x, file="freqterms00-06.csv") png("correlation2000-06.png", width=12, height=12, units="in", res=900) graph.par(list(edges=list(col="lightblue", lty="solid", lwd=0.3))) graph.par(list(nodes=list(col="darkgreen", lty="dotted", lwd=2, fontsize=50))) plot(dtm, terms=findFreqTerms(dtm, lowfreq=1000)[1:50],corThreshold=0.7) dev.off() 

When I use the mc.cores = 1 argument in tm_map, the operation continues indefinitely. However, if I use the argument lazy = TRUE in tm_map, it seems to be going well, but subsequent operations give this error.

 Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "try-error" In addition: Warning messages: 1: In mclapply(x$content[i], function(d) tm_reduce(d, x$lazy$maps)) : all scheduled cores encountered errors in user code 2: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code 

I searched for all the solutions, but it worked sequentially. Any help would be greatly appreciated!

Best! to

+6
source share
1 answer

I found a solution that works.

Background / Debug Patterns

I tried a few things that didn't work:

  • Adding "content_transformer" to some tm_map, to all, one (total)
  • Adding "lazy = T" to tm_map
  • Tried some parallel computing packages

While it does not work for my two scripts, it works every time for the third script. But the code for all three scripts is the same, only the size of the .rda file that I upload is different. The data structure is also identical for all three.

  • Dataset 1: Size - 493.3KB = error
  • Dataset 2: Size - 630.6KB = error
  • Dataset 3: Size - 300.2KB = works!

Just weird.

My sessionInfo() output is:

 R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) locale: [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] snowfall_1.84-6 snow_0.3-13 Snowball_0.0-11 RWekajars_3.7.11-1 rJava_0.9-6 RWeka_0.4-23 [7] slam_0.1-32 SnowballC_0.5.1 tm_0.6 NLP_0.1-5 twitteR_1.1.8 devtools_1.6 loaded via a namespace (and not attached): [1] bit_1.1-12 bit64_0.9-4 grid_3.1.2 httr_0.5 parallel_3.1.2 RCurl_1.95-4.3 rjson_0.2.14 stringr_0.6.2 [9] tools_3.1.2 

Decision

I just added this line after loading the data, and now everything works:

 MyCorpus <- tm_map(MyCorpus, content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')), mc.cores=1) 

A hint was found here: http://davetang.org/muse/2013/04/06/using-the-r_twitter-package/ (the author updated his code due to an error on November 26, 2014.)

+13
source

Source: https://habr.com/ru/post/977954/


All Articles