R tm stemCompletion generates NA value

when I try to apply stemCompletion to the body, this function generates NA values.

this is my code:

my.corpus <- tm_map(my.corpus, removePunctuation) my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) 

(one of the results: [[2584]] zoning plan)

The next step is the stamp of the case and so:

 my.corpus <- tm_map(my.corpus, stemDocument, language="english") my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first") 

but the result of this is

[[2584]] NA Plant

the next step is to create a matrix of incidents with transactions and then rules a priori, but if I continue and try to get the rules, the check (rules) function will give me this error:

 > inspect(rules) Errore in UseMethod("inspect", x) : no applicable method for 'inspect' applied to an object of class "c('rules','associations')" 

what problem? I believe that NA values ​​do not generate the incident matrix correctly, and then good rules .. is this a problem? if so, how can I solve it?

this is an abstract question:

 this is an abstract: my.words = c("β cell","zoning policy regional index brazil","zoning plan","zolpidem adult","zizyphus spinosa hu") my.corpus = Corpus(VectorSource(my.words)) my.corpus_copy = my.corpus my.corpus = tm_map(my.corpus, removePunctuation) my.corpus = tm_map(my.corpus, removeWords, c("the", stopwords("english"))) my.corpus = tm_map(my.corpus, stemDocument, language="english") my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first") inspect(my.corpus) 
+2
source share
1 answer

stemCompletion () at the moment is just an approximate change in the stem process if the source case is used as a dictionary parameter. Using grep () , it searches in the dictionary all words that contain the current word, and then uses one of them to complete based on type strong>.

Thus, it fails in cases where the return process returned words that are not substrings of unrelated words. For example, the stems "c" (delivery, "zoning") are c ("delivery", "zone"), returned by wordStem () , used in stemDocument () . However, in both of these cases, stem words are not regular substrings of unrelated words. Therefore, stemCompletion () will not find a replacement and will return NA.

There are many alternatives to solve this problem, including replacing NAs with dictionary words after returning from stemCompletion () or improving the modification of the stemCompletion () function itself . An easy way to change it so that instead of NA it retains the original word, should have its own version of stemCompletion_modified () : (replace ... with the source code from stemCompletion () in the tm package)

 stemCompletion_modified <- function (x, dictionary, type = ...) { ... #possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE)) possibleCompletions <- lapply(x, function(w) ifelse(identical(grep(sprintf("^%s", w), dictionary, value = TRUE),character(0)),w,grep(sprintf("^%s", w), dictionary, value = TRUE))) ... } 
+2
source

Source: https://habr.com/ru/post/1276266/


All Articles