What are some powerful tools for word processing and preprocessing in R?

I often use the Hadley stringr package to clean up dirty environmental data (normalization of species names, poorly formatted labels, etc.). I recently started studying sed and awk and was blown away by how effective these tools are, especially when dealing with multiple data files.

My questions:

  • Are there any other powerful word processing packages (outside the base functions and those in stringr ) that would be useful for data cleansing?

  • Is it possible to run sed commands / scripts from R? If so, how? Can you give me an example?

  • Has anyone tried to write a wrapper for sed as an R package. If not, would that be worthwhile (a side project for me or more competent programmers)?

+6
source share
1 answer

Firstly, with regard to sed and awk I usually did not need them, since they are especially old schools. I often write regular expressions in Perl and do the same, with slightly more readable readability. I don’t want to discuss the merits of the implementation, but when I do not write such functions in Perl, I find that gsub , grep and related regex tools work pretty well in R. Note that they can take perl = TRUE as an argument ; I prefer Perl regex processing.

As for much more serious packages, tm package is especially interesting. For more information about natural language processing resources and text analysis resources, see View CRAN Task for NLP .

In addition, I think your question title combined two concepts. Tools such as sed and awk, regular expressions, tokenization, etc., are important elements in word processing and preprocessing. Text processing is more statistical and depends on effective pre-processing and quantification of text data. Although not mentioned, the two subsequent stages of analysis, information retrieval, and natural language processing are research and engineering areas that are more specific for their purposes. If you are primarily interested in manipulating text, then it’s enough to use various tools for applying regular expressions and preprocessing / normalization. If you want to do predictive text processing, you need to learn more statistical functions. NLP will require tools that do some deeper analysis. All are accessible from within R, but the question is how far do you want to go down this rabbit hole? Want to learn a red pill?

+5
source

Source: https://habr.com/ru/post/901403/


All Articles