Order-independent fuzzy match "First Name Last Name" / "Last Name First Name" in R?

Question

Order-independent fuzzy match "First Name Last Name" / "Last Name First Name" in R?

I have two lists of names for the same set of students, which were compiled separately. There are many typographical errors, and I use fuzzy matching to link the two lists. I am 99 +% there with agrep and similar, but I am stuck with the following main problem: how can I match (for example) the names of the files "Adrian Bruce" and "Bruce Adrian"? Levenshtein's editing distance is not suitable for this particular case, since it counts the number of replacements.

This should be a very common problem, but I cannot find any standard R package or procedure for handling it. I suppose I'm missing something obvious ... ???

+4

string-matching r pattern-matching fuzzy

Jonathan burley Feb 02 '12 at 18:42

source share

2 answers

Tommy · Answer 1 · 2012-02-02T20:07:37+0000

Ok, one easy way is to replace the words and combine again ...

 y=c("Bruce Almighty", "Lee, Bruce", "Leroy Brown") y2 <- sub("(.*) (.*)", "\\2 \\1", y) agrep("Bruce Lee", y) # No match agrep("Bruce Lee", y2) # Match!

Mattia landoni · Answer 2 · 2017-06-24T03:36:06+0000

The technique that I usually use is fairly stable and relatively insensitive to ordering, punctuation, etc. It is based on objects called "n-grams." If n = 2, "bigrams". For instance:

 "Adrian Bruce" --> ("Ad","dr","ri","ia","an","n "," B","Br","ru","uc","ce") "Bruce Adrian" --> ("Br","ru","uc","ce","e "," A","Ad","dr","ri","ia","an")

Each line has 11 bigrams. 9 of them are common. Thus, the similarity score is very high: 9/11 or 0.818, where 1.000 is a perfect match.

I am not very familiar with R, but if the package does not exist, this method is very easy to code. You can write code that goes through the bigrams of line 1 and counts how many of them are in line 2.

Order-independent fuzzy match "First Name Last Name" / "Last Name First Name" in R?

More articles: