How to group words whose Levenshtein distance is more than 80% in Python

Suppose I have a list: -

person_name = ['zakesh', 'oldman LLC', 'bikash', 'goldman LLC', 'zikash','rakesh']

I am trying to group the list so that the Levenshtein distance between the two lines is maximum. To find out the relationship between the two words, I use the python fuzzywuzzy package .

Examples: -

>>> from fuzzywuzzy import fuzz
>>> combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
>>> fuzz.ratio('goldman LLC', 'oldman LLC')
95
>>> fuzz.ratio('rakesh', 'zakesh')
83
>>> fuzz.ratio('bikash', 'zikash')
83
>>> 

My ultimate goal:

My ultimate goal is to group words so that the distance between them is more than 80%?

My list should look something like this: -

person_name = ['bikash', 'zikash', 'rakesh', 'zakesh', 'goldman LLC', 'oldman LLC'] because the distance between `bikash` and `zikash` is very high so they should be together.

Code:

I am trying to achieve this by sorting, but the key function should be fuzz.ratio. Well, the below code does not work, but I approach the problem from this angle.

from fuzzywuzzy import fuzz
combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.sort(key=lambda x, y: fuzz.ratio(x, y))
print combined_list

- , 80 ?

+4
1

from fuzzywuzzy import fuzz

combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.append('bakesh')
print('input names:', combined_list)

grs = list() # groups of names with distance > 80
for name in combined_list:
    for g in grs:
        if all(fuzz.ratio(name, w) > 80 for w in g):
            g.append(name)
            break
    else:
        grs.append([name, ])

print('output groups:', grs)
outlist = [el for g in grs for el in g]
print('output list:', outlist)

input names: ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC', 'bakesh']
output groups: [['rakesh', 'zakesh', 'bakesh'], ['bikash', 'zikash'], ['goldman LLC', 'oldman LLC']]
output list: ['rakesh', 'zakesh', 'bakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']

, , , .

+5

Source: https://habr.com/ru/post/1627111/


All Articles