Find all subsets from the list of sets that appear in at least N different sets

Question

Find all subsets from the list of sets that appear in at least N different sets

I am working on an algorithm for grouping keywords from a search engine by the number of identical URLs that they have in SERP. Each group represents a URL, and each value is an identifier for the SERP keyword where the URL appeared.

I have a list of groups:

groups = [ [1], [1, 2 ,3], [1, 2, 3, 4, 5], [1, 2, 3 ,4], [2, 3], [4, 5, 6], [4, 5, 7] ]

I need to get ALL sets of elements that appear in at least N groups in decreasing order of size:

In the above example, for N = 3, we have two subsets: [1, 2, 3] and [4, 5]

As I see how to extract it:

Iteration 1: find the largest set that appears at least 3 times (this is [1, 2, 3]) and delete from the whole set where it appears.

After the iteration, we have:

 groups = [ [1], [4, 5], [4], [2, 3], [4, 5, 6], [4, 5, 7] ]

Iteration 2: find the largest one that appears at least 3 times (this is [4, 5])

After the iteration, we have:

 groups = [ [1], [4], [2, 3], [6], [7] ]

The end of the algorithm : because there are no more groups that appear at least 3 times in groups.

Do you have an idea for the algorithm to retrieve them?

N is between 1 and 10.

ps the list of groups is quite large, from 1000 to 10000 items. Numbers are identifiers of objects in db.

+3

python algorithm

Alex t May 12, '16 at 10:20

source share

2 answers

The first prototype approach / hack, combining the beauty of recursion, pseudo-functional programming and *** - for my part. There are many improvements, especially regarding iterators / lists. Perhaps it even qualifies as a spaghetti code :-).

Warning: see comment by @John Coleman regarding the binomial coefficient. We generate all possible subsets of the remaining values at each iteration. This can be improved if the generators are used lazily (but will still be out of reach for huge sets of unique numbers).

 import itertools groups = [ [1], [1, 2 ,3], [1, 2, 3, 4, 5], [1, 2, 3 ,4], [2, 3], [4, 5, 6], [4, 5, 7] ] def solve(groups, N, sol=[]): if len(groups) == 0: return sol rem_vals = list(set(itertools.chain(*groups))) combs = list(itertools.product(range(2), repeat=len(rem_vals))) combs_ = [[rem_vals[ind] for ind, i in enumerate(combs[comb]) if i] for comb in range(len(combs))] for cand in reversed(sorted(combs_, key=lambda x: len(list(itertools.chain(x))))): if len(cand) == 0: continue else: counter = 0 inds = [] for ind, g in enumerate(groups): if set(cand).issubset(g): counter += 1 inds.append(ind) if counter >= N: sol.append(cand) for i in inds: for j in cand: groups[i].remove(j) return solve(groups, N, sol) return sol print(solve(groups, 3))

Output

 [[1, 2, 3], [4, 5]]

+1

sascha May 12, '16 at 11:40

source share

John coleman · Accepted Answer · 2016-05-12T12:18:20+0000

Below is the itertools approach for the first part of what you call Iteration 1 . If it is possible to run it at all, you can run it several times in a loop, deleting the elements found at each stage. As I pointed out in the comments, this is only possible with small n :

 import itertools def intersect(sets): #forms the intersection of all s in sets #assumes that there is at least one I = sets[0].copy() for s in sets[1:]: I = I & s return I def largestIntersection(sets,n): maxSize = 0 maxSets = [set()] for groupCombo in itertools.combinations(sets,n): I = intersect(groupCombo) if len(I) > maxSize: maxSize = len(I) maxSets = [I] elif len(I) == maxSize: maxSets.append(I) return maxSets

For instance:

 >>> groups = [ [1], [1, 2 ,3], [1, 2, 3, 4, 5], [1, 2, 3 ,4], [2, 3], [4, 5, 6], [4, 5, 7] ] >>> groups = [set(group) for group in groups] >>> largestIntersection(groups,3) [{1, 2, 3}]

In the editor:. The following modification may help - depending on the distribution of numbers in the groups and the size of the groups:

 def guessSize(sets,n,trials): return max(len(intersect(random.sample(sets,n))) for trial in range(trials)) def maxIntersection(sets,n,trials = 100000): k = guessSize(sets,n,trials) filteredSets = [s for s in sets if len(s) >= k] return largestIntersection(filteredSets,n)

The idea is to first reduce the number of groups before trying to iterate over intersections. For instance:

 #stress test: import random nums = list(range(1,101)) testGroups = [set(random.sample(nums,random.randint(1,100))) for n in range(1000)] foundGroups = maxIntersection(testGroups,3)

it only takes a few seconds to calculate foundGroups , not a few minutes if I used the largestIntersection(testGroups) directly. On the other hand, with various variants of random parameters, the time savings become negligible.

Further editing: Perhaps you can be more aggressive with filtering:

 import collections def maxIntersection(sets,n,trials = 100000): k = guessSize(sets,n,trials) filteredSets = [s for s in sets if len(s) >= k] counts = collections.Counter() for s in filteredSets: for x in s: counts[x] +=1 t = {x for x,c in counts.items() if c >= k} filteredSets = [s & t for s in filteredSets if len(s & t) >= k] return largestIntersection(filteredSets,n)

Find all subsets from the list of sets that appear in at least N different sets

More articles: