Algorithm for the discrete similarity metric

Given that I have two lists, each of which contains a separate subset of a common superset, is there an algorithm to give me a similarity measurement?

Example:

A = {John, Mary, Kate, Peter} and B = {Peter, James, Mary, Kate}

How similar are these two lists? Note that I do not know all the elements of a common superset.

Update: I was obscure, and I probably used the word β€œset” carelessly. My apologies. Clarification: The order is important. If identical elements occupy the same position in the list, we have the highest similarity for this element. The similarity decreased further than identical elements. The similarity is even lower if an item exists in only one of the lists.

I could even add the additional dimension that lower indices are more important, so aa [1] == b [1] costs more than [9] == b [9], but that’s basically the reason I'm curious.

+4
source share
5 answers

I would consider two strategies:

  • Treat lists as sets and apply set ops (intersection, difference)
  • Treat lists as strings of characters and apply the Levenshtein algorithm
+2
source

The Jaccard Index (aka Tanimoto coefficient) is used exactly for the use case indicated in the OP question.

The Tanimoto coefficient, tau, is Nc divided by Na + Nb - Nc , or

tau = Nc / (Na + Nb - Nc) 
  • Na , the number of elements in the first set

  • Nb , the number of elements in the second set

  • Nc , the intersection of two sets or the number of unique elements common to a and b

Here Tanimoto is encoded as a Python function:

 def tanimoto(x, y) : w = [ ns for ns in x if ns not in y ] return float(len(w) / (len(x) + len(y) - len(w))) 
+13
source

If you really have sets (i.e. the element is simply present or absent, without binding), and only two of them, just adding the number of common elements and dividing by the total number of elements, is probably about as good as it gets.

If you have (or can get) a quantity and / or more than two of them, you can do a little better than using cosine simliarity or TFIDF (frequency of the frequency of the inverted document).

The latter is trying to give lower weighting to the words that appear in all (or almost) all "documents", i.e. sets of words.

+1
source

What is your definition of "similarity measurement"? If all you need is how many elements in the set are interconnected, you can find the power of A and B, add power together and subtract from the power of combining A and B.

0
source

If it comes to ordering, you can use Levenshtein distance or another Change distance .

0
source

Source: https://habr.com/ru/post/1341206/


All Articles