Percentage overlap of two lists

Question

Percentage overlap of two lists

This is more of a mathematical problem than anything else. Suppose I have two lists of different sizes in Python

listA = ["Alice", "Bob", "Joe"] listB = ["Joe", "Bob", "Alice", "Ken"]

I want to know what percentage overlaps these two lists. Order is not important in the listings. Finding a match is easy, I have seen other posts on how to do this, but I cannot expand it to find out what percentage they overlap. If I compared the lists in different orders, would the result be different? What would be the best way to do this?

+6

python list overlap similarity

OneManRiot Apr 28 '15 at 20:15

source share

4 answers

JuniorCompressor · Answer 1 · 2015-04-28T20:21:14+0000

The maximum difference is that the two lists have completely different elements. Thus, we have at most n + m discrete elements, where n is the size of the first list and m is the size of the second list. One measure may be:

 2 * c / (n + m)

where c is the number of common elements. This can be calculated as a percentage:

 200.0 * len(set(listA) & set(listB)) / (len(listA) + len(listB))

Ofiris · Answer 2 · 2015-04-28T20:21:46+0000

 >>> len(set(listA)&set(listB)) / float(len(set(listA) | set(listB))) * 100 75.0

I would calculate the common elements from the total number of individual elements.

len(set(listA)&set(listB)) returns common items (3 in your example).

len(set(listA) | set(listB)) returns the total number of individual items (4).

Multiply by 100 and you will get a percentage.

geckon · Answer 3 · 2015-04-28T20:45:20+0000

From the main point of view, I would say that there are two reasonable questions that you can ask:

What percentage overlaps compared to the first list? That is, how big is the overall part compared to the first list?
The same goes for the second list.
What percentage overlaps compared to the "universe" (ie, combining both lists)?

Of course, other meanings can be found, and there would be many. In general, you probably should know what problem you are trying to solve.

From a programming point of view, the solution is easy:

 listA = ["Alice", "Bob", "Joe"] listB = ["Joe", "Bob", "Alice", "Ken"] setA = set(listA) setB = set(listB) overlap = setA & setB universe = setA | setB result1 = float(len(overlap)) / len(setA) * 100 result2 = float(len(overlap)) / len(setB) * 100 result3 = float(len(overlap)) / len(universe) * 100

inspectorG4dget · Answer 4 · 2015-04-28T20:51:47+0000

 def computeOverlap(L1, L2): d1, d2 = {}, {} for e in L1: if e not in d1: d1[e] = 1 d1[e] += 1 for e in L2: if e not in d2: d2[e] = 0 d2[e] += 1 o1, o2 = 0, 0 for k in d1: o1 += min(d1[k], d2.get(k,0)) for k in d2: o2 += min(d1.get(k,0), d2[k]) print((100*o1) if o1 else 0 "% of the first list overlaps with the second list") print((100*o2) if o2 else 0 "% of the second list overlaps with the first list")

Of course, you could just do it with defaultdict and a counter to make things a little easier:

 from collections import defaultdict, Counter def computeOverlap(L1, L2): d1 = defaultdict(int, Counter(L1)) d2 = defaultdict(int, Counter(L2)) o1, o2 = 0, 0 for k in d1: o1 += min(d1[k], d2[k]) for k in d2: o2 += min(d1[k,0], d2[k]) print((100*o1) if o1 else 0 "% of the first list overlaps with the second list") print((100*o2) if o2 else 0 "% of the second list overlaps with the first list")

Percentage overlap of two lists

More articles: