DocumentTermMatrix error with strange error only when # terms> 3000

My code below works fine if I don't use DocumentTermMatrix creation with over 3000 conditions. This line:

movie_dict <- findFreqTerms(movie_dtm_train, 8) movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train, list(dictionary = movie_dict)) 

Failure:

 Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 'i, j, v' different lengths In addition: Warning messages: 1: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code 2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : NAs introduced by coercion 

Is there any way to handle this? Is the 3000 * 60000 matrix too large for DocumentTermMatrix? It looks pretty small for classifying documents, though ..

Full code snippet:

 n1 <- 60000 n2 <- 70000 #******* loading the data ****************************************** #kaggle sentiment_analysis dataset movie_all <- read.delim('train.tsv', stringsAsFactors=FALSE) movie_raw <- movie_all[1:(n2),] #******* cleaning the corpus *************************************** movie_corpus <- Corpus(VectorSource(movie_raw$Phrase)) movie_corpus_clean <- tm_map(movie_corpus, content_transformer(tolower)) movie_corpus_clean <- tm_map(movie_corpus_clean, removeNumbers) movie_corpus_clean <- tm_map(movie_corpus_clean, removeWords, stopwords()) movie_corpus_clean <- tm_map(movie_corpus_clean, removePunctuation) movie_corpus_clean <- tm_map(movie_corpus_clean, stripWhitespace) movie_dtm <- DocumentTermMatrix(movie_corpus_clean) #*********** break out data into train/test sets ******************* movie_train <- movie_raw[1:(n1),] movie_corpus_train <- movie_corpus_clean[1:(n1)] movie_dtm_train <- movie_dtm[1:(n1),] #*********** remove rare words from document term matrix *********** movie_dict <- findFreqTerms(movie_dtm_train, 8) movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train, list(dictionary = movie_dict)) 

Edit This fails:

 movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train[1:60000], list(dictionary = movie_dict)) 

but it works:

 d1 <- DocumentTermMatrix(movie_corpus_train[1:30000], list(dictionary = movie_dict)) d2 <- DocumentTermMatrix(movie_corpus_train[30000:60000], list(dictionary = movie_dict)) movie_dtm_hiFq_train <- c(d1, d2) 

which makes me think this is a size issue.

+6
source share

Source: https://habr.com/ru/post/971192/


All Articles