I am trying to filter stop words from the following documents using the tm package.
library(tm) documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus") corpus <- Corpus(VectorSource(documents)) matrix <- DocumentTermMatrix(corpus,control=list(stopwords=TRUE))
However, when I run this code, I still get the following in DocumentTermMatrix .
colnames(matrix) [1] "brown" "dog" "fox" "jumps" "lazy" "over" "quick" "the" "walrus"
A "list" is indicated as a stop word in a list that uses the tm package. Am I doing something wrong with respect to the stopwords parameter, or is this an error in the tm package?
EDIT: I contacted Ingo Feiner, and he noted that technically this is not a mistake:
First, user-provided parameters are processed, and then all other options. Therefore, the deletion stops by default before tokenization (as already written by Vincent Zoonekynd on stackoverflow.com), which gives exactly your result.
Therefore, the solution is to explicitly specify the default token setting before the stopwords parameter, for example:
library(tm) documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus") corpus <- Corpus(VectorSource(documents)) matrix <- DocumentTermMatrix(corpus,control=list(tokenize=scan_tokenizer,stopwords=TRUE)) colnames(matrix)
source share