Algorithm for the discrete similarity metric

Question

Algorithm for the discrete similarity metric

Given that I have two lists, each of which contains a separate subset of a common superset, is there an algorithm to give me a similarity measurement?

Example:

A = {John, Mary, Kate, Peter} and B = {Peter, James, Mary, Kate}

How similar are these two lists? Note that I do not know all the elements of a common superset.

Update: I was obscure, and I probably used the word “set” carelessly. My apologies. Clarification: The order is important. If identical elements occupy the same position in the list, we have the highest similarity for this element. The similarity decreased further than identical elements. The similarity is even lower if an item exists in only one of the lists.

I could even add the additional dimension that lower indices are more important, so aa [1] == b [1] costs more than [9] == b [9], but that’s basically the reason I'm curious.

+4

comparison algorithm machine-learning similarity

Cubed Feb 24 '11 at 20:10

source share

5 answers

The Jaccard Index (aka Tanimoto coefficient) is used exactly for the use case indicated in the OP question.

The Tanimoto coefficient, tau, is Nc divided by Na + Nb - Nc , or

tau = Nc / (Na + Nb - Nc)

Na , the number of elements in the first set
Nb , the number of elements in the second set
Nc , the intersection of two sets or the number of unique elements common to a and b

Here Tanimoto is encoded as a Python function:

 def tanimoto(x, y) : w = [ ns for ns in x if ns not in y ] return float(len(w) / (len(x) + len(y) - len(w)))

+13

doug Feb 24 '11 at 20:19

source share

If you really have sets (i.e. the element is simply present or absent, without binding), and only two of them, just adding the number of common elements and dividing by the total number of elements, is probably about as good as it gets.

If you have (or can get) a quantity and / or more than two of them, you can do a little better than using cosine simliarity or TFIDF (frequency of the frequency of the inverted document).

The latter is trying to give lower weighting to the words that appear in all (or almost) all "documents", i.e. sets of words.

+1

Jerry Coffin Feb 24 '11 at 20:15

source share

What is your definition of "similarity measurement"? If all you need is how many elements in the set are interconnected, you can find the power of A and B, add power together and subtract from the power of combining A and B.

0

gpcz Feb 24 '11 at 20:18

source share

If it comes to ordering, you can use Levenshtein distance or another Change distance .

0

Peter Popov Feb 24 '11 at 23:23

source share

Ekkehard.Horner · Accepted Answer · 2011-02-24T20:16:53+0000

I would consider two strategies:

Treat lists as sets and apply set ops (intersection, difference)
Treat lists as strings of characters and apply the Levenshtein algorithm

Algorithm for the discrete similarity metric

More articles: