Replace the set of matches with the corresponding replacement strings in R

The str_replace (and preg_replace ) function in PHP replaces all occurrences of the search string with the replacement string. What interests me the most is that if search and replace args are arrays (we call these vectors in R), then str_replace takes a value from each array (vector) and uses them to search and replace with the item.

In other words, does R (or some package R) have a function to perform the following actions:

 string <- "The quick brown fox jumped over the lazy dog." patterns <- c("quick", "brown", "fox") replacements <- c("slow", "black", "bear") xxx_replace_xxx(string, patterns, replacements) ## ??? ## [1] "The slow black bear jumped over the lazy dog." 

So, I'm looking for something like chartr , but for search patterns and replacement strings of arbitrary number of characters. This cannot be done with a single call to gsub() since its replacement argument can only be one line, see ?gsub . So my current implementation is similar:

 xxx_replace_xxx <- function(string, patterns, replacements) { for (i in seq_along(patterns)) string <- gsub(patterns[i], replacements[i], string, fixed=TRUE) string } 

However, I am looking for something much faster if length(patterns) large - I have a lot of data to process and I am not satisfied with the current results.

Examples of these benchmarking toys:

 string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8") patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy", "po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy", "sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze") replacements <- paste0(patterns, rev(patterns)) 
+5
source share
3 answers

Using PCRE instead of fixed matching takes ~ 1/3 of the time on my computer for your example.

 xxx_replace_xxx_pcre <- function(string, patterns, replacements) { for (i in seq_along(patterns)) string <- gsub(patterns[i], replacements[i], string, perl=TRUE) string } system.time(x <- xxx_replace_xxx(string, patterns, replacements)) # user system elapsed # 0.491 0.000 0.491 system.time(p <- xxx_replace_xxx_pcre(string, patterns, replacements)) # user system elapsed # 0.162 0.000 0.162 identical(x,p) # [1] TRUE 
+10
source

If the patterns are fixed strings of dictionary characters, as in the example, this works. gsubfn is similar to gsub , except that the replacment argument can be a string, list, function, or proto object. If his list, as here, compares the matches with the regular expression with the names and for those found, replaces them with the corresponding values:

 library(gsubfn) gsubfn("\\b\\w+\\b", as.list(setNames(replacements, patterns)), string) ## [1] "The slow black bear jumped over the lazy dog." 
+8
source

This can be done using stringi> = 0.3-1, using one of the functions stri_replace_*_all with the argument vectorize_all set to FALSE :

 library("stringi") string <- "The quicker brown fox jumped over the lazy dog." patterns <- c("quick", "brown", "fox") replacements <- c("slow", "black", "bear") stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE) ## [1] "The slower black bear jumped over the lazy dog." stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE) ## [1] "The quicker black bear jumped over the lazy dog." 

Some guidelines:

 string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8") patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy", "po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy", "sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze") replacements <- paste0(patterns, rev(patterns)) microbenchmark::microbenchmark( stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE), stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE), xxx_replace_xxx_pcre(string, "\\b" %s+% patterns %s+% "\\b", replacements), gsubfn("\\b\\w+\\b", as.list(setNames(replacements, patterns)), string), unit="relative", times=25 ) ## Unit: relative ## expr min lq mean median uq max neval ## stri_replace_all_fixed 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 25 ## stri_replace_all_regex 2.169701 2.248115 2.198638 2.267935 2.267635 1.753289 25 ## xxx_replace_xxx_pcre 1.983135 1.967303 1.937021 1.961449 1.974422 1.469894 25 ## gsubfn 63.067835 69.870657 69.815031 71.178841 72.503020 57.019072 25 

So, with regard to matching only word boundaries, the PCRE-based version is the fastest.

+4
source

Source: https://habr.com/ru/post/1205958/


All Articles