R: agrep result adder

Is there a built-in way to quantify function results agrep? For example. in

agrep("test", c("tesr", "teqr", "toar"), max = 2, v=T)
[1] "tesr" "teqr"

tesr- this is only 1 char permutation of test, but teqrequal to 2, and toarequal to 3, and therefore not found. It seems to tesrhave a higher "probability" than teqr. How can it be found either in the number of permutations, or in percentage? Thanks!

Edit: Sorry for not putting this in the first place. I am already performing a two-step procedure: agrepto get my list, and then adistto get N permutations. adistruns slower, runtime is a big factor in my dataset

+4
source share
2 answers

Levenshtein distance is the number of corrections from one line to another. The RecordLinkage package may be of interest. Here, the calculation of the editing distance is given, which should be performed on a par withagrep . Although he will not return the same results as agrep.

library(RecordLinkage)
ld <- levenshteinDist("test", c("tesr", "teqr", "toar"))
c("tesr", "teqr", "toar")[which(ld < 3)]
+3
source

Another option with adist():

s <- c("tesr", "teqr", "toar")
s[adist("test", s) < 3]

Or using stringdist

library(stringdist)
s[stringdist("test", s, method = "lv") < 3]

What gives:

#[1] "tesr" "teqr"

Benchmark

x <- rep(s, 10e5)
library(microbenchmark)
mbm <- microbenchmark(
  levenshteinDist = x[which(levenshteinDist("test", x) < 3)],
  adist = x[adist("test", x) < 3],
  stringdist = x[stringdist("test", x, method = "lv") < 3],
  times = 10
)

What gives: enter image description here

Unit: milliseconds
            expr       min        lq      mean    median        uq       max neval cld
 levenshteinDist  840.7897 1255.1183 1406.8887 1398.4502 1510.5398 1960.4730    10  b 
           adist 2760.7677 2905.5958 2993.9021 2986.1997 3038.7692 3472.7767    10   c
      stringdist  145.8252  155.3228  210.4206  174.5924  294.8686  355.1552    10 a  
+4
source

Source: https://habr.com/ru/post/1613211/


All Articles