Best practice for determining the probability of matching two strings

I need to write code to determine if two lines match when one of the lines may contain a slight deviation from the second line, for example. “South Africa” v “South Africa” or “England” v “Enlgand”. I am currently considering the following approach

  • Determine the percentage of characters in line 1 that match the values ​​in line 2
  • Determine the true probability of a match by combining result 1 with a comparison of the length of two lines, for example. although all characters in “SA” are in “South Africa”, this is not a very likely match, since “SA” could be found in other country names.

I would appreciate the best practice for doing this string matching.

+4
source share
5 answers

You can see the Levenshtein distance . This is the distance between two lines. The same lines have a distance of 0. Lines, such as a kitten and a centimeter, have a distance of 1, etc. Distance is measured by the minimum number of simple operations that convert one line to another.

Additional information and the algorithm in the pseudo-code are given in the link.

I also remember that this topic was mentioned in Game Stones of Programming: Volume 6 : Article 1.6. Nearest-String Matching Algorithm

+12
source

To create the perfect fuzzy string option, it's important to know about the context of the strings. When it comes to small typos, Levenshtein can be pretty good. When it comes to incorrect sound, you can use a phonetic algorithm such as soundex or metaphone. In most cases, you need a combination of the following algorithms and some more specific hand-written files.

  • Needleman-wunsch
  • Soundex
  • Metaphone
  • Levenshtein distance
  • Bitmap image
  • Hamming Distance

There is no better algorithm for comparing fuzzy strings. It's all about the context in which it is used, so you need to tell us about where you want to use string matching.

+9
source

Do not reinvent the wheel. Wikipedia has a Levenshtein algorithm that has metrics for what you want to do.

http://en.wikipedia.org/wiki/Levenshtein_distance

There is also Soundex, but this may be too simplistic for your requirements.

+3
source

Using Soundex turned out to be nice for me: With a little tweaking or two implementations, Soundex can check cross languages if two strings of different languages ​​sound the same ..

Objective-C Implementation of Soundex: http://www.cocoadev.com/index.pl?NSStringSoundex

0
source

I found an implementation of the Objective-C Levenshtein distance algorithm here . This works great for me.

0
source

Source: https://habr.com/ru/post/1302074/


All Articles