Delete sets that are subsets

Suppose you are given two lists of sets of integers. Call them A and B. Each set of A can be contained in one of the sets in B. What is the most efficient algorithm for finding all elements in such that none of them are contained in set B?

+5
source share
1 answer

To try to crop the search space, we can try some checks and filters as part of the search. To take your example in the comments,

A = [{1,2},{1,4},{3,4,7}] B = [{2,3,4},{1,2,4},{1,2,5,6}] 

start by listing the O (n) unique elements in the sets as keys pointing to the cross-correlated indices of the sets to which they belong:

 1: A: {0,1}, B: {1,2} 2: A: {0} , B: {0,1,2} 3: A: {2} , B: {0} 4: A: {1,2}, B: {0,1} 5: A: {} , B: {2} 6: A: {} , B: {2} 7: A: {2} , B: {} 

We can immediately distinguish sets in A containing an element not found in B, such as the Third set.

Now we will go through each set from A that has not been excluded, and make sure that there is a corresponding complete intersection of at least one set from B. Since the number of set indices in your case is in millions, instead of initially going around B as a whole, we will divide our research B into sections, each of the sets of k , say 1024. In addition, to represent these 1024 sets of pointers, we will divide them into 16 bits of 64 bits each so that we can bitwise AND (&) one with the other.

At any time during this detour, we benefit from an early exit if the AND operation results in zero:

 set A[0] => elements 1,2 => set-index-intersection in B: b110 & b111 => match set A[1] => elements 1,4 => set-index-intersection in B: b110 & b11 => match 

In general, working in sections, we will consider about 10 * 16 operations to check whether one of A is included in one of the sets in the current section of k B-sets. In other words, we reduced the number of operations from 10,000,000 to 160,000 to fully test one set in (per one million sets in B). This is a ratio of 62.

+1
source

Source: https://habr.com/ru/post/1265030/


All Articles