Maintaining the smallest subset

Here are the operations that I would like to perform in a hypothetical data structure of a collection that contains many as its elements:

  • Insert the set into the data structure, but: (1) if the new set is a subset of any of the existing sets, do not add it (2) if the new set is a subset of any existing sets, delete them.
  • List all collections in the collection.

All considered sets are subsets of a known finite set, for example {0..10 ^ 4}.

Is there any way to do this efficiently?

+4
source share
6 answers

Here is a recent article on this issue: http://research.google.com/pubs/pub36974.html

In short, you cannot do much better than squared time in the worst case; but there are some tricks to speed it up in practice.

+1
source

All considered sets are subsets of a known finite set, for example {0..10 ^ 4}.

Let me call it N = 10 ^ 4. It is small enough and it will prove useful. Say you have S.

Logically means that you have an N * S matrix.

You will already have a set of sets. There are S-sets in this primary structure.

10 ^ 4 is small enough so that you can save a secondary data structure that stores for each value N the list of sets in which it is located. This structure is similar to transposing the primary structure. This can be a vector of length N, allowing you to search for a constant search of time to find a list of sets in which a certain value is located.

Now, when you add a new set, you can use this secondary structure to find which other sets are in each of their values. For example, we add a new set with values ​​of 2.5, 10

new_set = {2,5,10} 

The secondary structure tells us which sets they are in:

  2 : {A,B,D} 5 : {B,D} 10 : {B} 

We can combine and sort these three lists to get ABBBDD , which tells us not only what it imposes on it, but also about the size of the floors. Three nodes are shared with B, which means that our new set is a subset or equal to B. We share 1 node with A and two nodes with D. If it turns out that the total size of A is 1, then we now know that A is a subset our new multitude.

+1
source

Listing sets in a collection is easy, O (n). However, checking a new candidate to see if he will be a subset of all existing sets will be somewhat expensive. There are well-known algorithms for testing if one set is a subset of the other, so simple

 for each subset s in S for each candidate set C test of C is a subset of s if it is, break if never found, add C to S. 

It will be something like O (n ^ 2 log n). Is it considered "effective"?

0
source

Maintain a flowering filter for all saved sets. Create a color filter for the set you want to insert. If you are a bitwise AND filter for the set to be inserted (call it X) with a different flowering filter and get the value X back, then your set will be inserted. You can be a subset (possibly false positive, you need to check the slow path at this point). Otherwise, this is definitely not the case, and you can try with another.

There are many customizable options when creating a flowering filter that allows you to compromise between space efficiency and the likelihood of false positives.

http://en.wikipedia.org/wiki/Bloom_filter

0
source

For space efficiency, you can use the bit specified to represent each subset of a known finite set. There are also methods for representing sparse bit sets (see, for example, this Java sample ) to obtain additional space savings.

The general structure may be a set of bit sets. In Java, BitSet does not have a subset testing method, but I do not think it would be too difficult to extend BitSet to include an effective subset testing method. (This avoids the hideous task of checking whether the added candidate is equal to its intersection with any of the existing subsets.)

0
source

Use some kind of tree structure.

Eg. Save the sorted existing sets in Trie. A flag is supported on each node if the path leading to this node is an existing set

1 To check if a given set is a superset of an existing set:

 def issuperset(node, set[N], setc, N): if node.is_set: return True for j = setc:N if set[j] is a child of node: if issuperset(node.child[set[j]], set, j+1, N): return True return False 

2 Delete all supersets of the given set.

 def remsuperset(node, set[N], setc, N): if setc == N+1: remove_all_sets_on_or_below_this_node(node) return for ch in node.child: if ch< set[setc]: remsuperset(node.child[ch], set, setc, N) elif ch == set[setc]: remsuperset(node.child[ch], set, setc+1, N) 

3 To enumerate sets, simply cross the tree and the print path - this is_set flag is set to True

0
source

Source: https://habr.com/ru/post/1392808/


All Articles