Best way to find a subset of a set from a group of sets

First, sorry for the ambiguous name.

Suppose I have the following group of sets:

Group 1

s1 = ( x1, y1 ) s2 = ( x2 ) 

Group 2

 m1 = ( x1, y1, y2 ) m2 = ( x1 ) m3 = ( x1 , x2 ) 

For each of the sets from Group 1 - call the set s , I need to find the sets in Group 2 - call it m - so that m is a subset of s .

So, for my example, the answer will be as follows:

 s1 -> m2 s2 -> nothing 

I am currently storing the values ​​in std:set , but I can change this if necessary. In addition, sets can become large, so the algorithm must be efficient. At the moment, I have an approach with brute force, which I am not completely satisfied.

Any suggestions?

+6
source share
4 answers

The first step is to sort group 1 by capacity (i.e. size). Then the algorithm has an order of magnitude:

 foreach std::set M in "Group 2" { foreach std::set S in "Group 1" and S.size()>=M.size() { // replace with binary search if ( std::includes(S.begin(),S.end(),M.begin(),M.end()) ) { /* M is a subset of S */ } } } } 

This should have a time complexity of ~ O (MSR), where M is the number of sets in Group 2, S is the number of sets in Group 1, and R is the size of the largest set in Group # 1.

Edit: It just occurred to me that it would be more efficient to use S.find() instead of calling std::includes() (which iterates sequentially), but I think this would only be true if M. size () is much smaller than S.size () - O (M + S) vs O (MlogS).

+1
source

You do not know how rude your approach is. As long as you use the specified query functions in std :: namespace, they are likely to be as efficient as possible. For example, if set_intersection (s1.begin (), s2.end (), m1.begin (), m1.end ()) is equivalent to m1.

You could be more efficient than this, since you do not want to copy the corresponding elements, just to know that they all appear. This can be done by copying the set_intersection code, but changing the implementation to just count the number of matching elements, rather than copying them. Then, if the counter matches the size m, you will get a match.

Regarding containers, I often prefer to sort deque by set for large collections. Memory is much less distributed over the heap, which helps with caching. It also avoids the overhead of the base tree. This is especially useful when containers are created once, but run multiple times.

0
source

Are your sets often modified or read-only / mostly?

  • If changed frequently, std::set represents the perfect balance between modification and sorting performance.
  • If only reading or reading is used, you can use the sorted std::vector . Sorting is expensive, but actually cheaper than creating the whole tree in std::set , so performance is better if you do it rarely enough.

Once you have created the sorted containers (whether they are "automatically sorted" by std::set or manually sorted by std::vector ), you can test the subset with std::includes . BTW, if you need to find the right subsets, you can compare the number of elements after that.

0
source

You can try something like this. Steps:

  • create an array containing all the objects in both groups
  • convert all s and m to a bitmap, where array (i) = 1 if the set contains object (i), 0 otherwise
  • m (k) is a subset of s (j) if m (k) AND s (j) = m (k)
0
source

Source: https://habr.com/ru/post/908543/


All Articles