Algorithms for calculating the distance between two lines

Question

Algorithms for calculating the distance between two lines

Is there any string distance algorithm that doesn't take word order into account?

The following algorithms do not produce the desired results (in this example, the desired result should be 1):

import jaro
jaro.jaro_winkler_metric(u'Michael Jordan',u'Jordan Michael')
>>>0.47

import Levenshtein
Levenshtein.ratio('Michael Jordan', 'Jordan Michael')
>>>0.5

from difflib import SequenceMatcher
SequenceMatcher(None, 'Michael Jordan', 'Jordan Michael').ratio()
>>>0.5

One way to do this is to have the string alphabetically and later used for the above algorithms:

''.join(sorted('Michael Jordan'))
>>>' JMaacdehilnor'

''.join(sorted('Jordan Michael'))
>>>' JMaacdehilnor'

But here the information about the name and surname is lost and will not have “stable” results.

I created a function using permutationsfrom itertools, which takes all possible compilations of words and compares strings and displays the maximum value. The results are satisfactory, but the whole procedure is very slow when I have to compare millions of names.

Something else that can be done is to sort the words, such as:

' '.join(sorted('Michael Jordan'.split()))
>>>'Jordan Michael'
' '.join(sorted('Jordan Michael'.split()))
>>>'Jordan Michael'

, . :

name1 = ' '.join(sorted('Bizen Dim'.split()))
>>>'Bizen Dim'
name2 = ' '.join(sorted('Dim Mpizen'.split()))
>>>'Dim Mpizen'

SequenceMatcher(None, name1, name2).ratio()
>>>  0.55

, , "" "b" "mp" ( ). , "".

- , ? , ?

+4

python string levenshtein-distance

Mpizos Dimitris 14 . '16 18:16

3

Roman · Answer 1 · 2016-09-12T17:37:01+0000

fuzzywuzzy

:

pip install fuzzywuzzy
pip install python-Levenshtein

:

fuzz.token_sort_ratio(u'Michael Jordan',u'Jordan Michael')
>>100

Valkyrie · Answer 2 · 2016-02-14T18:48:48+0000

, . - python , . ( levenshtein, )

>>> ''.join(sorted('Michael Jordan'.lower()))
' aacdehijlmnor'

.index() . ( , re )

Tomer levinboim · Answer 3 · 2016-02-14T21:45:10+0000

You can tokenize two lines (for example, with the NLTK tokenizer), calculate the distance between each pair of words and return the sum of all distances.

Algorithms for calculating the distance between two lines

More articles: