Difflib SequenceMatcher - Individual Equality

Question

Difflib SequenceMatcher - Individual Equality

I am trying to create a nested or recursive effect using SequenceMatcher.

The ultimate goal is to compare two sequences, both of which can contain instances of different types.

For example, sequences may be:

l1 = [1, "Foo", "Bar", 3] l2 = [1, "Fo", "Bak", 2]

Typically, the SequenceMatcher will only identify [1] as a common subsequence for l1 and l2.

I would like SequnceMatcher to be applied twice to row instances , so "Foo" and "Fo" will be considered equal, as well as "Bar" and "Bak" , and the longest common subsequence will have a length of 3 [1, Foo/Fo, Bar/Bak] . That is, I would like SequenceMatcher to be more forgiving when comparing string members .

I tried to write a wrapper for the str built-in class:

 from difflib import SequenceMatcher class myString: def __init__(self, string): self.string = string def __hash__(self): return hash(self.string) def __eq__(self, other): return SequenceMatcher(a=self.string, b=self.string).ratio() > 0.5

Edit: a possibly more elegant way:

 class myString(str): def __eq__(self, other): return SequenceMatcher(a=self, b=other).ratio() > 0.5

Having done this, you can do the following:

 >>> Foo = myString("Foo") >>> Fo = myString("Fo") >>> Bar = myString("Bar") >>> Bak = myString("Bak") >>> l1 = [1, Foo, Bar, 3] >>> l2 = [1, Fo, Bak, 2] >>> SequenceMatcher(a=l1, b=l2).ratio() 0.75

So, obviously this works, but I have a bad feeling about overriding the hash function. When is a hash used? Where can he come back and bite me?

SequenceMatcher documentation says the following:

This is a flexible class for comparing pairs of sequences of any type while sequence elements are hashed .

And by definition, hashable elements must fulfill the following requirement:

Hashable objects that compare peers must have the same hash value .

Also, do I need to override cmp ?

I would like to hear about other solutions that come to mind.

Thanks.

+6

python difflib

Yaronk Sep 7 '13 at 10:03

source share

1 answer

ap · Answer 1 · 2015-05-18T21:27:08+0000

Your solution is not bad - you can also take a look at SequenceMatcher's recurring work to apply recursively when the elements of a sequence are iterable themselves, with some user logic. That would be a pain. If you want only this subset of SequenceMatcher functionality, writing your own comparison tool might not be a bad idea.

Overriding __hash__ to make "Foo" and "Fo" equal will cause collisions in dictionaries (hash tables), etc. If you are literally only interested in the first 2 characters and are set using SequenceMatcher, returning cls.super(self[2:]) may be the way to go.

All that said, the best choice is a disposable tool. I can outline the basics of something like that if you're interested. You just need to know what the limitations are in the circumstances (does the subsequence of the first element always start, this kind of thing).

Difflib SequenceMatcher - Individual Equality

More articles: