Firstly, with regard to sed and awk I usually did not need them, since they are especially old schools. I often write regular expressions in Perl and do the same, with slightly more readable readability. I donβt want to discuss the merits of the implementation, but when I do not write such functions in Perl, I find that gsub , grep and related regex tools work pretty well in R. Note that they can take perl = TRUE as an argument ; I prefer Perl regex processing.
As for much more serious packages, tm package is especially interesting. For more information about natural language processing resources and text analysis resources, see View CRAN Task for NLP .
In addition, I think your question title combined two concepts. Tools such as sed and awk, regular expressions, tokenization, etc., are important elements in word processing and preprocessing. Text processing is more statistical and depends on effective pre-processing and quantification of text data. Although not mentioned, the two subsequent stages of analysis, information retrieval, and natural language processing are research and engineering areas that are more specific for their purposes. If you are primarily interested in manipulating text, then itβs enough to use various tools for applying regular expressions and preprocessing / normalization. If you want to do predictive text processing, you need to learn more statistical functions. NLP will require tools that do some deeper analysis. All are accessible from within R, but the question is how far do you want to go down this rabbit hole? Want to learn a red pill?
source share