I am trying to make 2 document matrix matrices for the case, one with unigrams and one with bitrams. However, the bigram matrix is ββcurrently just identical to the unigram matrix, and I'm not sure why.
Code:
docs<-Corpus(DirSource("data", recursive=TRUE)) # Get the document term matrices BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words", removePunctuation = TRUE, stopwords = stopwords("english"), stemming = TRUE)) dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer, removePunctuation = TRUE, stopwords = stopwords("english"), stemming = TRUE)) inspect(dtm_unigram) inspect(dtm_bigram)
I also tried using ngram (x, n = 2) from the ngram package as a tokenizer, but this does not work either. How to fix bigram tokenization?
source share