ngramLength doesn't seem to work. Here is a workaround:
library(RTextTools) library(tm) library(RWeka) # this library is needed for NGramTokenizer library texts <- c("This is the first document.", "Is this a text?", "This is the second file.", "This is the third text.", "File is not this.") TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) dtm <- DocumentTermMatrix(Corpus(VectorSource(texts)), control=list( weighting = weightTf, tokenize = TrigramTokenizer)) as.matrix(dtm)
The RWeka uses RWeka NGramTokenizer instead of the create_matrix called by create_matrix . Now you can use dtm in other RTextTools functions, for example, to train the classification model below:
isText <- c(T,F,T,T,F) container <- create_container(dtm, isText, virgin=F, trainSize=1:3, testSize=4:5) models=train_models(container, algorithm=c("SVM","BOOSTING")) classify_models(container, models)
source share