Agrep () unexpected results related to max.distance in R

EDIT: This error was detected in 32-bit versions of R; it was fixed in version 2.9.2 R.


It was written for me by @leoniedu today, and I have no answer for it, so I thought I'd post it here.

I read the documentation for agrep () (fuzzy string matching) and it seems like I don't fully understand the max.distance parameter. Here is an example:

pattern <- "Staatssekretar im Bundeskanzleramt"
x <- "Bundeskanzleramt"
agrep(pattern,x,max.distance=18) 
agrep(pattern,x,max.distance=19)

It behaves exactly as I expected. There are 18 characters between lines, so I expect this to be a match threshold. Here's what bothers me:

agrep(pattern,x,max.distance=30) 
agrep(pattern,x,max.distance=31)
agrep(pattern,x,max.distance=32) 
agrep(pattern,x,max.distance=33)

Why are 30 and 33 coincidences, but not 31 and 32? To save you a bill,

> nchar("Staatssekretar im Bundeskanzleramt")
[1] 34
> nchar("Bundeskanzleramt")
[1] 16
+3
source share
2 answers

R R-bugs. , , , , - . JD Long .

, , , R, agrep , , grep " ". , . ( !)

Linux- , Mac Windows.

Mac: sessionInfo() R 2.9.1 (2009-06-26) i386--darwin8.11.1 : en_US.UTF-8/en_US.UTF-8///en_US.UTF-8/en_US.UTF-8

agrep (, , max.distance = 30) [1] 1

agrep (, , max.distance = 31) (0) agrep (, , max.distance = 32) (0) agrep (, , max.distance = 33) [1] 1

Linux: R 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu

: LC_CTYPE = en_US.UTF-8; LC_NUMERIC = ; LC_TIME = en_US.UTF-8; LC_COLLATE = en_US.UTF-8; LC_MONETARY = ; LC_MESSAGES = en_US.UTF-8; LC_PAPER = en_US.UTF-8; lc_name = ; LC_ADDRESS = ; LC_TELEPHONE = ; LC_MEASUREMENT = en_US.UTF-8; LC_IDENTIFICATION =

agrep (, , max.distance = 30) [1] 1 agrep (, , max.distance = 31) [1] 1 agrep (, , max.distance = 32) [1] 1 agrep (, , max.distance = 33) [1] 1

+2

, . grep() , x , . , x .

, grep substr:

R> grep("vo", c("foo","bar","baz"))   # vo is not in the vector
integer(0)
R> agrep("vo", c("foo","bar","baz"), value=TRUE) # but is close enough to foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.25) # still foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.75) # now all match
[1] "foo" "bar" "baz"
R>  
0

Source: https://habr.com/ru/post/1713542/