Document matrix in terminal R - binary tokenizer does not work

Question

Document matrix in terminal R - binary tokenizer does not work

I am trying to make 2 document matrix matrices for the case, one with unigrams and one with bitrams. However, the bigram matrix is currently just identical to the unigram matrix, and I'm not sure why.

Code:

docs<-Corpus(DirSource("data", recursive=TRUE)) # Get the document term matrices BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words", removePunctuation = TRUE, stopwords = stopwords("english"), stemming = TRUE)) dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer, removePunctuation = TRUE, stopwords = stopwords("english"), stemming = TRUE)) inspect(dtm_unigram) inspect(dtm_bigram)

I also tried using ngram (x, n = 2) from the ngram package as a tokenizer, but this does not work either. How to fix bigram tokenization?

+2

r tokenize tm n-gram rweka

filaments Mar 05 '17 at 4:11

source share

1 answer

filaments · Accepted Answer · 2017-03-28T18:30:48+0000

The tokenizer parameter doesn't seem to work with Corpus (SimpleCorpus). Instead, VCorpus has resolved the issue.

Document matrix in terminal R - binary tokenizer does not work

More articles: