Find all subsets from the list of sets that appear in at least N different sets

I am working on an algorithm for grouping keywords from a search engine by the number of identical URLs that they have in SERP. Each group represents a URL, and each value is an identifier for the SERP keyword where the URL appeared.


I have a list of groups:

groups = [ [1], [1, 2 ,3], [1, 2, 3, 4, 5], [1, 2, 3 ,4], [2, 3], [4, 5, 6], [4, 5, 7] ] 

I need to get ALL sets of elements that appear in at least N groups in decreasing order of size:

In the above example, for N = 3, we have two subsets: [1, 2, 3] and [4, 5]


As I see how to extract it:

Iteration 1: find the largest set that appears at least 3 times (this is [1, 2, 3]) and delete from the whole set where it appears.

After the iteration, we have:

 groups = [ [1], [4, 5], [4], [2, 3], [4, 5, 6], [4, 5, 7] ] 

Iteration 2: find the largest one that appears at least 3 times (this is [4, 5])

After the iteration, we have:

 groups = [ [1], [4], [2, 3], [6], [7] ] 

The end of the algorithm : because there are no more groups that appear at least 3 times in groups.


Do you have an idea for the algorithm to retrieve them?

N is between 1 and 10.

ps the list of groups is quite large, from 1000 to 10000 items. Numbers are identifiers of objects in db.

+3
source share
2 answers

Below is the itertools approach for the first part of what you call Iteration 1 . If it is possible to run it at all, you can run it several times in a loop, deleting the elements found at each stage. As I pointed out in the comments, this is only possible with small n :

 import itertools def intersect(sets): #forms the intersection of all s in sets #assumes that there is at least one I = sets[0].copy() for s in sets[1:]: I = I & s return I def largestIntersection(sets,n): maxSize = 0 maxSets = [set()] for groupCombo in itertools.combinations(sets,n): I = intersect(groupCombo) if len(I) > maxSize: maxSize = len(I) maxSets = [I] elif len(I) == maxSize: maxSets.append(I) return maxSets 

For instance:

 >>> groups = [ [1], [1, 2 ,3], [1, 2, 3, 4, 5], [1, 2, 3 ,4], [2, 3], [4, 5, 6], [4, 5, 7] ] >>> groups = [set(group) for group in groups] >>> largestIntersection(groups,3) [{1, 2, 3}] 

In the editor:. The following modification may help - depending on the distribution of numbers in the groups and the size of the groups:

 def guessSize(sets,n,trials): return max(len(intersect(random.sample(sets,n))) for trial in range(trials)) def maxIntersection(sets,n,trials = 100000): k = guessSize(sets,n,trials) filteredSets = [s for s in sets if len(s) >= k] return largestIntersection(filteredSets,n) 

The idea is to first reduce the number of groups before trying to iterate over intersections. For instance:

 #stress test: import random nums = list(range(1,101)) testGroups = [set(random.sample(nums,random.randint(1,100))) for n in range(1000)] foundGroups = maxIntersection(testGroups,3) 

it only takes a few seconds to calculate foundGroups , not a few minutes if I used the largestIntersection(testGroups) directly. On the other hand, with various variants of random parameters, the time savings become negligible.

Further editing: Perhaps you can be more aggressive with filtering:

 import collections def maxIntersection(sets,n,trials = 100000): k = guessSize(sets,n,trials) filteredSets = [s for s in sets if len(s) >= k] counts = collections.Counter() for s in filteredSets: for x in s: counts[x] +=1 t = {x for x,c in counts.items() if c >= k} filteredSets = [s & t for s in filteredSets if len(s & t) >= k] return largestIntersection(filteredSets,n) 
+1
source

The first prototype approach / hack, combining the beauty of recursion, pseudo-functional programming and *** - for my part. There are many improvements, especially regarding iterators / lists. Perhaps it even qualifies as a spaghetti code :-).

Warning: see comment by @John Coleman regarding the binomial coefficient. We generate all possible subsets of the remaining values ​​at each iteration. This can be improved if the generators are used lazily (but will still be out of reach for huge sets of unique numbers).

 import itertools groups = [ [1], [1, 2 ,3], [1, 2, 3, 4, 5], [1, 2, 3 ,4], [2, 3], [4, 5, 6], [4, 5, 7] ] def solve(groups, N, sol=[]): if len(groups) == 0: return sol rem_vals = list(set(itertools.chain(*groups))) combs = list(itertools.product(range(2), repeat=len(rem_vals))) combs_ = [[rem_vals[ind] for ind, i in enumerate(combs[comb]) if i] for comb in range(len(combs))] for cand in reversed(sorted(combs_, key=lambda x: len(list(itertools.chain(x))))): if len(cand) == 0: continue else: counter = 0 inds = [] for ind, g in enumerate(groups): if set(cand).issubset(g): counter += 1 inds.append(ind) if counter >= N: sol.append(cand) for i in inds: for j in cand: groups[i].remove(j) return solve(groups, N, sol) return sol print(solve(groups, 3)) 

Output

 [[1, 2, 3], [4, 5]] 
+1
source

Source: https://habr.com/ru/post/1233397/


All Articles