Package tm stop word parameter

Question

Package tm stop word parameter

I am trying to filter stop words from the following documents using the tm package.

 library(tm) documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus") corpus <- Corpus(VectorSource(documents)) matrix <- DocumentTermMatrix(corpus,control=list(stopwords=TRUE))

However, when I run this code, I still get the following in DocumentTermMatrix .

 colnames(matrix) [1] "brown" "dog" "fox" "jumps" "lazy" "over" "quick" "the" "walrus"

A "list" is indicated as a stop word in a list that uses the tm package. Am I doing something wrong with respect to the stopwords parameter, or is this an error in the tm package?

EDIT: I contacted Ingo Feiner, and he noted that technically this is not a mistake:

First, user-provided parameters are processed, and then all other options. Therefore, the deletion stops by default before tokenization (as already written by Vincent Zoonekynd on stackoverflow.com), which gives exactly your result.

Therefore, the solution is to explicitly specify the default token setting before the stopwords parameter, for example:

 library(tm) documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus") corpus <- Corpus(VectorSource(documents)) matrix <- DocumentTermMatrix(corpus,control=list(tokenize=scan_tokenizer,stopwords=TRUE)) colnames(matrix)

+4

r nlp

Timothy P. Jurka Jan 26 '12 at 8:18

source share

3 answers

You can also try to remove stop words from the case before creating the term matrix.

 text_corpus <- tm_map(text_corpus, removeWords, stopwords("english")) dtm <- DocumentTermMatrix(text_corpus)

This usually works for me.

+4

Shreyes May 19 '13 at 8:27

source share

A quick fix will be as follows:

 matrix <- matrix[,!colnames(matrix)%in%stopwords()]

+2

Sacha epskamp Jan 26 '12 at 9:40

source share

Vincent zoonekynd · Accepted Answer · 2012-01-26T08:58:55+0000

This is a mistake: you can report this to the author (s) of the package. The termFreq function applies various texts to texts, but not always in the correct order. In your example, the code tries to remove stop words before tokenization, that is, before the text is cut into words - this should be after we find out what these words are.

Package tm stop word parameter

More articles: