R tm replace words in Corpus using gsub

I have a large document body with over 200 documents. As you can expect from such a large corpus, some words are spelled, used in different formats, etc. Etc. I performed standard text processing such as lowercase, punctuation, and collocation. I am trying to replace some words to correct spelling and standardize them before moving on to analysis. I have done over 100 substitutions using the same syntax as below, and for most substitutions it works as expected. However, some (about 5%) do not work. For example, the following substitutions appear to have a limited effect:

docs <- tm_map(docs, content_transformer(gsub), pattern = "medecin|medicil|medicin|medicinee", replacement = "medicine")
docs <- tm_map(docs, content_transformer(gsub), pattern = "eephant|eleph|elephabnt|elleph|elephanyt|elephantant|elephantant", replacement = "elephant")
docs <- tm_map(docs, content_transformer(gsub), pattern = "firehood|firewod|firewoo|firewoodloc|firewoog|firewoodd|firewoodd", replacement = "firewood") 

The limited effect I mean is that while some permutations work, some are not. For example, despite trying to replace "elephant", "medice", "woodwoodd", they still exist when I create a DTM (document matrix).

I do not know why this mixed effect occurs.

Also, the following line replaces each word in the case with some combination of collection:

docs <- tm_map(docs, content_transformer(gsub), pattern = "colect|colleci|collectin|collectiong|collectng|colllect|", replacement = "collect")

Just for reference, when I replace only one word, I use the syntax (note the fixed = TRUE):

docs <- tm_map(docs, content_transformer(gsub), pattern = "charcola", replacement = "charcoal", fixed=TRUE)

The only substitution and failure:

docs <- tm_map(docs, content_transformer(gsub), pattern = "dogmonkeycat", replacement = "dog monkey cat", fixed=TRUE)
+4
source share
1 answer

, , , "", .. , .

"" (, ) :

pattern = "\\b(medecin|medicil|medicin|medicinee)\\b"

:

pattern = "medicinee|medecin|medicil|medicin"

, , (. [ei]) :

pattern = "med[ie]ci(?:n(?:ee)?|l)"
+5

Source: https://habr.com/ru/post/1649246/


All Articles