Best practice for determining the probability of matching two strings

Question

Best practice for determining the probability of matching two strings

I need to write code to determine if two lines match when one of the lines may contain a slight deviation from the second line, for example. “South Africa” v “South Africa” or “England” v “Enlgand”. I am currently considering the following approach

Determine the percentage of characters in line 1 that match the values in line 2
Determine the true probability of a match by combining result 1 with a comparison of the length of two lines, for example. although all characters in “SA” are in “South Africa”, this is not a very likely match, since “SA” could be found in other country names.

I would appreciate the best practice for doing this string matching.

+4

ios objective-c iphone cocoa-touch

Runloop Feb 22 '10 at 20:47

source share

5 answers

To create the perfect fuzzy string option, it's important to know about the context of the strings. When it comes to small typos, Levenshtein can be pretty good. When it comes to incorrect sound, you can use a phonetic algorithm such as soundex or metaphone. In most cases, you need a combination of the following algorithms and some more specific hand-written files.

Needleman-wunsch
Soundex
Metaphone
Levenshtein distance
Bitmap image
Hamming Distance

There is no better algorithm for comparing fuzzy strings. It's all about the context in which it is used, so you need to tell us about where you want to use string matching.

+9

Paco Feb 22 '10 at 21:00

source share

Do not reinvent the wheel. Wikipedia has a Levenshtein algorithm that has metrics for what you want to do.

http://en.wikipedia.org/wiki/Levenshtein_distance

There is also Soundex, but this may be too simplistic for your requirements.

+3

stillstanding Feb 22 '10 at 21:02

source share

Using Soundex turned out to be nice for me: With a little tweaking or two implementations, Soundex can check cross languages if two strings of different languages sound the same ..

Objective-C Implementation of Soundex: http://www.cocoadev.com/index.pl?NSStringSoundex

0

Rabindev May 26, '11 at 14:19

source share

I found an implementation of the Objective-C Levenshtein distance algorithm here . This works great for me.

0

vomako May 10, '15 at 10:13

source share

Dawid fatyga · Accepted Answer · 2010-02-22T20:51:54+0000

You can see the Levenshtein distance . This is the distance between two lines. The same lines have a distance of 0. Lines, such as a kitten and a centimeter, have a distance of 1, etc. Distance is measured by the minimum number of simple operations that convert one line to another.

Additional information and the algorithm in the pseudo-code are given in the link.

I also remember that this topic was mentioned in Game Stones of Programming: Volume 6 : Article 1.6. Nearest-String Matching Algorithm

Best practice for determining the probability of matching two strings

More articles: