How to create a threshold for such lines using Levenshtein distance and typos?

Recently, we encountered an interesting problem when duplicate user data was found in our database. We realized that the Levenshtein distance between most of this data was simply the difference between the two lines indicated. This indicates that if we just add characters from one line to another, we get the same line, and for most things this seems like the best way for us to consider elements that are duplicates.

We also want to consider typos. Therefore, we began to think about how often people make typos online per word and try to use this data at this distance. We could not find such statistics.

Is there any way to take typos into account when creating such a threshold for matching data?

Let me know if I can clarify!

+3
source share
2 answers

First, the Levenshtein distance is defined as the minimum number of changes needed to convert line A to line B, where editing is the insertion or deletion of one character or the replacement of a character by another character. So this is a very “difference between the two lines”, for a certain definition of distance. =)

, F (A, B), A B N, N . Needleman-Wunsch. , , . , QWERTY-, . .

k , , , O (k ^ 2). , O (len (A) * len (B)). , , , - . , :

  • , , , , F (A, B) F (B, A).
  • abs (len (A) - len (B)) A B. , , .

, , , "1st St." " ", , , , . - , , , . , , , "1-" "" .. , , .

php, , php . PHP levenshtein(), 255 . , . , Python difflib.

+7

Source: https://habr.com/ru/post/1756675/


All Articles