Distance between words

I am trying to find how many characters I need to remove in order to make two words the same. For example, "at", "cat" will be 1 because I could remove c, "boat" and "got" would be 3, because I could remove b, a and g to do this ot. I put words in a dictionary with their meaning as meaning. Then I iterate over the dictionary and see if this key exists in another dictionary, otherwise I will add 1 to the difference. Is this a very inefficient algorithm?

But this overestimates the number of deletions I need.

def deletiondistance(firstword, secondword): dfw = {} dsw = {} diff = 0 for i in range(len(firstword)): print firstword[i] if firstword[i] in dfw: dfw[firstword[i]]+=1 else: dfw[firstword[i]]=1 for j in range(len(secondword)): if secondword[j] in dsw: dsw[secondword[j]] +=1 else: dsw[secondword[j]]=1 for key, value in dfw.iteritems(): if key in dsw: #print "key exists" pass else: diff +=1 print "diff",diff 
+6
source share
3 answers

As @Hulk said, this is similar to Levenshtein distance. The only difference is that replacements are not allowed, but this can be corrected using replacement cost 2, which is the same as removing a character from both lines. Example:

 def dist(s1, s2): cur = list(range(len(s2) + 1)) prev = [0] * (len(s2) + 1) for i in range(len(s1)): cur, prev = prev, cur cur[0] = i + 1 for j in range(len(s2)): # Substitution is same as two deletions sub = 0 if s1[i] == s2[j] else 2 cur[j+1] = min(prev[j] + sub, cur[j] + 1, prev[j+1] + 1) return cur[-1] cases=[('cat','bat'), ('bat','cat'), ('broom', 'ballroom'), ('boat','got'), ('foo', 'bar'), ('foobar', '')] for s1, s2 in cases: print('{} & {} = {}'.format(s1, s2, dist(s1, s2))) 

Output:

 cat & bat = 2 bat & cat = 2 broom & ballroom = 3 boat & got = 3 foo & bar = 6 foobar & = 6 
+3
source

I think your goal is similar to levenshtein distance.

Levenshtein distance is a metric for measuring the distance between two strings.

Here is the wiki link. https://en.wikipedia.org/wiki/Levenshtein_distance

And here is the pypi package for levenshtein distance. https://pypi.python.org/pypi/python-Levenshtein

+4
source

You can use difflib for this.

Example:

 import difflib cases=[('cat','bat'), ('bat','cat'), ('broom', 'ballroom'), ('boat','got')] for a,b in cases: print('{} => {}'.format(a,b)) cnt=0 for i,s in enumerate(difflib.ndiff(a, b)): if s[0]==' ': continue elif s[0]=='-': print(u'Delete "{}" from position {}'.format(s[-1],i)) elif s[0]=='+': print(u'Add "{}" to position {}'.format(s[-1],i)) cnt+=1 print("total=",cnt,"\n") 

Print

 cat => bat Delete "c" from position 0 Add "b" to position 1 total= 2 bat => cat Delete "b" from position 0 Add "c" to position 1 total= 2 broom => ballroom Add "a" to position 1 Add "l" to position 2 Add "l" to position 3 total= 3 boat => got Delete "b" from position 0 Add "g" to position 1 Delete "a" from position 3 total= 3 
+2
source

Source: https://habr.com/ru/post/1013407/


All Articles