Stemming function for text2vec

I use text2vec in R and find it difficult to write a generation function that works with the itoken function in the text2vec package. The text2vec documentation offers this feature:

stem_tokenizer1 =function(x) { word_tokenizer(x) %>% lapply(SnowballC::wordStem(language='en')) } 

However, this feature does not work. This is the code I used (borrowed from previous stackoverflow answers):

 library(text2vec) library(data.table) library(SnowballC) data("movie_review") train_rows = 1:1000 prepr = tolower stem_tokenizer1 =function(x) { word_tokenizer(x) %>% lapply(SnowballC::wordStem(language='en')) } tok = stem_tokenizer1 it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows]) 

This is the error it produces:

Error in {: the word argument is missing, with no default value

I believe the problem is that wordStem needs a character vector, but word_tokenizer creates a list of character vectors.

 mr<-movie_review$review[1] stem_mr1<-stem_tokenizer1(mr) 

Error in SnowballC :: wordStem (language = "en"): the "word" argument is missing, without a default value

To fix this problem, I wrote this function:

 stem_tokenizer2 = function(x) { list(unlist(word_tokenizer(x)) %>% SnowballC::wordStem(language='en') ) } 

However, this function does not work with the create_vocabulary function.

 data("movie_review") train_rows = 1:1000 prepr = tolower stem_tokenizer2 = function(x) { list(unlist(word_tokenizer(x)) %>% SnowballC::wordStem(language='en') ) } tok = stem_tokenizer2 it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows]) v <- create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5) 

There is no mistake, but when you look at the number of documents, the number of documents is different from 1000 in the data, so you cannot create a matrix of document terms or run LDA.

 v$document_count 

[1] 10

This code:

 dtm_train <- create_dtm(it, vectorizer) dtm_train 

Skip this error:

10 x 3809 sparse matrix of class "dgCMatrix" Error in validObject (x): Invalid object of class "dgCMatrix": length (Dimnames [1]) differs from Dim [1], which is 10

My questions are: is there something wrong with the function that I wrote, and why the function that I wrote creates this error with create_vocabulary? I suspect this is a problem with the output format of my function, but it looks identical to the output format of the word_tokenizer function and works fine with itoken and create_vocabulary:

 mr<-movie_review$review[1] word_mr<-word_tokenizer(mr) stem_mr<-stem_tokenizer2(mr) str(word_mr) str(stem_mr) 
+6
source share
1 answer

Thanks for using text2vec and report a problem. There is an error in the docs (can you tell me where I put this example so that I can fix it?). The token symbol should look like this:

 stem_tokenizer1 =function(x) { word_tokenizer(x) %>% lapply( function(x) SnowballC::wordStem(x, language="en")) } 

The logic is as follows:

  • It takes a symbol vector and symbolizes it. Output is a list of character vectors (each list item = character vector is a document).
  • Then we apply the binding to each element of the list ( wordStem can be applied to a character vector)

So my syntax error was for lapply in the example you used. Mb will be more understandable if we rewrite it without the %>% operator in plain R so that it looks like this:

 stem_tokenizer1 =function(x) { tokens = word_tokenizer(x) lapply(tokens, SnowballC::wordStem, language="en") } 

I will also explain why you get 10 documents instead of 1000. By default, text2vec::itoken splits the data into 10 fragments (this can be adjusted in the itoken function) and process the fragment with a piece. Therefore, when you apply unlist on each fragment, you actually write off 100 documents recursively and create 1 character vector.

+5
source

Source: https://habr.com/ru/post/1012524/


All Articles