Document matrix in terminal R - binary tokenizer does not work

I am trying to make 2 document matrix matrices for the case, one with unigrams and one with bitrams. However, the bigram matrix is ​​currently just identical to the unigram matrix, and I'm not sure why.

Code:

docs<-Corpus(DirSource("data", recursive=TRUE)) # Get the document term matrices BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words", removePunctuation = TRUE, stopwords = stopwords("english"), stemming = TRUE)) dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer, removePunctuation = TRUE, stopwords = stopwords("english"), stemming = TRUE)) inspect(dtm_unigram) inspect(dtm_bigram) 

I also tried using ngram (x, n = 2) from the ngram package as a tokenizer, but this does not work either. How to fix bigram tokenization?

+2
source share
1 answer

The tokenizer parameter doesn't seem to work with Corpus (SimpleCorpus). Instead, VCorpus has resolved the issue.

+1
source

Source: https://habr.com/ru/post/1265368/


All Articles