I have a large document body with over 200 documents. As you can expect from such a large corpus, some words are spelled, used in different formats, etc. Etc. I performed standard text processing such as lowercase, punctuation, and collocation. I am trying to replace some words to correct spelling and standardize them before moving on to analysis. I have done over 100 substitutions using the same syntax as below, and for most substitutions it works as expected. However, some (about 5%) do not work. For example, the following substitutions appear to have a limited effect:
docs <- tm_map(docs, content_transformer(gsub), pattern = "medecin|medicil|medicin|medicinee", replacement = "medicine")
docs <- tm_map(docs, content_transformer(gsub), pattern = "eephant|eleph|elephabnt|elleph|elephanyt|elephantant|elephantant", replacement = "elephant")
docs <- tm_map(docs, content_transformer(gsub), pattern = "firehood|firewod|firewoo|firewoodloc|firewoog|firewoodd|firewoodd", replacement = "firewood")
The limited effect I mean is that while some permutations work, some are not. For example, despite trying to replace "elephant", "medice", "woodwoodd", they still exist when I create a DTM (document matrix).
I do not know why this mixed effect occurs.
Also, the following line replaces each word in the case with some combination of collection:
docs <- tm_map(docs, content_transformer(gsub), pattern = "colect|colleci|collectin|collectiong|collectng|colllect|", replacement = "collect")
Just for reference, when I replace only one word, I use the syntax (note the fixed = TRUE):
docs <- tm_map(docs, content_transformer(gsub), pattern = "charcola", replacement = "charcoal", fixed=TRUE)
The only substitution and failure:
docs <- tm_map(docs, content_transformer(gsub), pattern = "dogmonkeycat", replacement = "dog monkey cat", fixed=TRUE)