I use text2vec in R and find it difficult to write a generation function that works with the itoken function in the text2vec package. The text2vec documentation offers this feature:
stem_tokenizer1 =function(x) { word_tokenizer(x) %>% lapply(SnowballC::wordStem(language='en')) }
However, this feature does not work. This is the code I used (borrowed from previous stackoverflow answers):
library(text2vec) library(data.table) library(SnowballC) data("movie_review") train_rows = 1:1000 prepr = tolower stem_tokenizer1 =function(x) { word_tokenizer(x) %>% lapply(SnowballC::wordStem(language='en')) } tok = stem_tokenizer1 it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows])
This is the error it produces:
Error in {: the word argument is missing, with no default value
I believe the problem is that wordStem needs a character vector, but word_tokenizer creates a list of character vectors.
mr<-movie_review$review[1] stem_mr1<-stem_tokenizer1(mr)
Error in SnowballC :: wordStem (language = "en"): the "word" argument is missing, without a default value
To fix this problem, I wrote this function:
stem_tokenizer2 = function(x) { list(unlist(word_tokenizer(x)) %>% SnowballC::wordStem(language='en') ) }
However, this function does not work with the create_vocabulary function.
data("movie_review") train_rows = 1:1000 prepr = tolower stem_tokenizer2 = function(x) { list(unlist(word_tokenizer(x)) %>% SnowballC::wordStem(language='en') ) } tok = stem_tokenizer2 it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows]) v <- create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5)
There is no mistake, but when you look at the number of documents, the number of documents is different from 1000 in the data, so you cannot create a matrix of document terms or run LDA.
v$document_count
[1] 10
This code:
dtm_train <- create_dtm(it, vectorizer) dtm_train
Skip this error:
10 x 3809 sparse matrix of class "dgCMatrix" Error in validObject (x): Invalid object of class "dgCMatrix": length (Dimnames [1]) differs from Dim [1], which is 10
My questions are: is there something wrong with the function that I wrote, and why the function that I wrote creates this error with create_vocabulary? I suspect this is a problem with the output format of my function, but it looks identical to the output format of the word_tokenizer function and works fine with itoken and create_vocabulary:
mr<-movie_review$review[1] word_mr<-word_tokenizer(mr) stem_mr<-stem_tokenizer2(mr) str(word_mr) str(stem_mr)