Algorithms for calculating the distance between two lines

Is there any string distance algorithm that doesn't take word order into account?

The following algorithms do not produce the desired results (in this example, the desired result should be 1):

import jaro
jaro.jaro_winkler_metric(u'Michael Jordan',u'Jordan Michael')
>>>0.47

import Levenshtein
Levenshtein.ratio('Michael Jordan', 'Jordan Michael')
>>>0.5

from difflib import SequenceMatcher
SequenceMatcher(None, 'Michael Jordan', 'Jordan Michael').ratio()
>>>0.5

One way to do this is to have the string alphabetically and later used for the above algorithms:

''.join(sorted('Michael Jordan'))
>>>' JMaacdehilnor'

''.join(sorted('Jordan Michael'))
>>>' JMaacdehilnor'

But here the information about the name and surname is lost and will not have β€œstable” results.

I created a function using permutationsfrom itertools, which takes all possible compilations of words and compares strings and displays the maximum value. The results are satisfactory, but the whole procedure is very slow when I have to compare millions of names.

Something else that can be done is to sort the words, such as:

' '.join(sorted('Michael Jordan'.split()))
>>>'Jordan Michael'
' '.join(sorted('Jordan Michael'.split()))
>>>'Jordan Michael'

, . :

name1 = ' '.join(sorted('Bizen Dim'.split()))
>>>'Bizen Dim'
name2 = ' '.join(sorted('Dim Mpizen'.split()))
>>>'Dim Mpizen'

SequenceMatcher(None, name1, name2).ratio()
>>>  0.55

, , "" "b" "mp" ( ). , "".

- , ? , ?

+4
3

fuzzywuzzy

:

pip install fuzzywuzzy
pip install python-Levenshtein

:

fuzz.token_sort_ratio(u'Michael Jordan',u'Jordan Michael')
>>100
+2

, . - python , . ( levenshtein, )

>>> ''.join(sorted('Michael Jordan'.lower()))
' aacdehijlmnor'

.index() . ( , re )

0

You can tokenize two lines (for example, with the NLTK tokenizer), calculate the distance between each pair of words and return the sum of all distances.

0
source

Source: https://habr.com/ru/post/1628733/


All Articles