Order-independent fuzzy match "First Name Last Name" / "Last Name First Name" in R?

I have two lists of names for the same set of students, which were compiled separately. There are many typographical errors, and I use fuzzy matching to link the two lists. I am 99 +% there with agrep and similar, but I am stuck with the following main problem: how can I match (for example) the names of the files "Adrian Bruce" and "Bruce Adrian"? Levenshtein's editing distance is not suitable for this particular case, since it counts the number of replacements.

This should be a very common problem, but I cannot find any standard R package or procedure for handling it. I suppose I'm missing something obvious ... ???

+4
source share
2 answers

Ok, one easy way is to replace the words and combine again ...

 y=c("Bruce Almighty", "Lee, Bruce", "Leroy Brown") y2 <- sub("(.*) (.*)", "\\2 \\1", y) agrep("Bruce Lee", y) # No match agrep("Bruce Lee", y2) # Match! 
+3
source

The technique that I usually use is fairly stable and relatively insensitive to ordering, punctuation, etc. It is based on objects called "n-grams." If n = 2, "bigrams". For instance:

 "Adrian Bruce" --> ("Ad","dr","ri","ia","an","n "," B","Br","ru","uc","ce") "Bruce Adrian" --> ("Br","ru","uc","ce","e "," A","Ad","dr","ri","ia","an") 

Each line has 11 bigrams. 9 of them are common. Thus, the similarity score is very high: 9/11 or 0.818, where 1.000 is a perfect match.

I am not very familiar with R, but if the package does not exist, this method is very easy to code. You can write code that goes through the bigrams of line 1 and counts how many of them are in line 2.

0
source

Source: https://habr.com/ru/post/1394455/


All Articles