My code below works fine if I don't use DocumentTermMatrix creation with over 3000 conditions. This line:
movie_dict <- findFreqTerms(movie_dtm_train, 8) movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train, list(dictionary = movie_dict))
Failure:
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 'i, j, v' different lengths In addition: Warning messages: 1: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code 2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : NAs introduced by coercion
Is there any way to handle this? Is the 3000 * 60000 matrix too large for DocumentTermMatrix? It looks pretty small for classifying documents, though ..
Full code snippet:
n1 <- 60000 n2 <- 70000
Edit This fails:
movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train[1:60000], list(dictionary = movie_dict))
but it works:
d1 <- DocumentTermMatrix(movie_corpus_train[1:30000], list(dictionary = movie_dict)) d2 <- DocumentTermMatrix(movie_corpus_train[30000:60000], list(dictionary = movie_dict)) movie_dtm_hiFq_train <- c(d1, d2)
which makes me think this is a size issue.
source share