The most common subset of size k

Suppose you have a list of subsets S1,...,Sn an integer range R={1,2,...,N} and an integer k . Is there an efficient way to find a subset C of R size k , so that C is a subset of the maximum number Si ?

As an example, let R={1,2,3,4} and k=2

 S1={1,2,3} S2={1,2,3} S3={1,2,4} S4={1,3,4} 

Then I want to return either C={1,2} or C={1,3} (it doesn't matter which one).

+6
source share
3 answers

I think your problem is NP-Hard. Consider a bipartite graph with left nodes that are your collections, and right nodes with integers {1, ..., N} , with an edge between two nodes if the collection contains an integer. Then to find a common subset of size k , which is a subset of the maximum number Si , is equivalent to finding a complete bipartite subgraph K(i, k) with the maximum number of edges i*k . If you could do it in polynomial time, then you could find a complete bipartite subgraph K(i, j) with the maximum number of edges i*j in polynomial time, trying for each fixed k . But this problem is in NP-Complete.

So, if P = NP, your problem does not have a polynomial time algorithm.

+2
source

Assuming I understand your question, I believe that this is true for fairly small sets.

I use Mathematica code for illustration, but the concept is universal.

I generate 10 random subsets of length 4 from the set {1 .. 8}:

 ss = Subsets[ Range@8 , {4}] ~RandomSample~ 10 
 {{1, 3, 4, 6}, {2, 6, 7, 8}, {3, 5, 6, 7}, {2, 4, 6, 7}, {1, 4, 5, 8}, {2, 4, 6, 8}, {1, 2, 3, 8}, {1, 6, 7, 8}, {1, 2, 4, 7}, {1, 2, 5, 7}} 

I convert them to a binary array from the presence of each number in each subset:

 a = Normal@SparseArray [Join @@ MapIndexed[Tuples[{##}] &, ss] -> 1]; Grid[a] 

Mathematica graphics

These are ten columns for ten subsets and eight rows for elements {1 .. 8}.

Now create all possible target subsets (size 3 ):

 keys = Subsets[Union @@ ss, {3}]; 

Take the “key” and extract these lines from the array and perform the BitAnd operation (return 1 if all columns are 1 ), and then count the number of units. For example, for the key {1, 6, 8} we have:

 a[[{1, 6, 8}]] 

Mathematica graphics

After BitAnd:

Mathematica graphics

Do this for each key:

 counts = Tr[BitAnd @@ a[[#]]] & /@ keys; 

Then find the position (s) of the maximum element of this list and extract the corresponding parts of keys :

 keys ~Extract~ Position[counts, Max@counts ] 
 {{1, 2, 7}, {2, 4, 6}, {2, 4, 7}, {2, 6, 7}, {2, 6, 8}, {6, 7, 8}} 

With sufficient memory, this process works quickly for a larger set. Starting from 50,000 randomly selected subsets of length 7 from {1 .. 30}:

 ss = Subsets[ Range@30 , {7}] ~RandomSample~ 50000; 

The maximum subsets of length 4 are calculated in about nine seconds:

 AbsoluteTiming[ a = Normal@SparseArray [Join @@ MapIndexed[Tuples[{##}] &, ss] -> 1]; keys = Subsets[Union @@ ss, {4}]; counts = Tr[BitAnd @@ a[[#]]] & /@ keys; keys~Extract~Position[counts, Max@counts ] ] 
  {8.8205045, {{2, 3, 4, 20}, {7, 10, 15, 18}, {7, 13, 16, 26}, {11, 21, 26, 28}}} 

I must add that Mathematica is a high-level language, and these operations are performed on shared objects, so if it is really done on a binary level, it should be much faster and have more memory efficiency.

+1
source

I hope I do not understand the problem ... Here is the solution in SWI-Prolog

 :- module(subsets, [solve/0]). :- [library(pairs), library(aggregate)]. solve :- problem(R, K, Subsets), once(subset_of_maximal_number(R, K, Subsets, Subset)), writeln(Subset). problem(4, 2, [[1,2,3], [1,2,3], [1,2,4], [1,3,4]]). problem(8, 3, [[1, 3, 4, 6], [2, 6, 7, 8], [3, 5, 6, 7], [2, 4, 6, 7], [1, 4, 5, 8], [2, 4, 6, 8], [1, 2, 3, 8], [1, 6, 7, 8], [1, 2, 4, 7], [1, 2, 5, 7]]). subset_of_maximal_number(R, K, Subsets, Subset) :- flatten(Subsets, Numbers), findall(Num-Count, ( between(1, R, Num), aggregate_all(count, member(Num, Numbers), Count) ), NumToCount), transpose_pairs(NumToCount, CountToNumSortedR), reverse(CountToNumSortedR, CountToNumSorted), length(Subset, K), % list of free vars prefix(SolutionsK, CountToNumSorted), pairs_values(SolutionsK, Subset). 

test output:

 ?- solve. [1,3] true ; [7,6,2] true. 

edit: I think the above solution is wrong, in the sense that the return could not be a subset of any of the input: here is a (commented) solution without this problem:

 :- module(subsets, [solve/0]). :- [library(pairs), library(aggregate), library(ordsets)]. solve :- problem(R, K, Subsets), once(subset_of_maximal_number(R, K, Subsets, Subset)), writeln(Subset). problem(4, 2, [[1,2,3], [1,2,3], [1,2,4], [1,3,4]]). problem(8, 3, [[1, 3, 4, 6], [2, 6, 7, 8], [3, 5, 6, 7], [2, 4, 6, 7], [1, 4, 5, 8], [2, 4, 6, 8], [1, 2, 3, 8], [1, 6, 7, 8], [1, 2, 4, 7], [1, 2, 5, 7]]). subset_of_maximal_number(R, K, Subsets, Subset) :- flatten(Subsets, Numbers), findall(Num-Count, ( between(1, R, Num), aggregate_all(count, member(Num, Numbers), Count) ), NumToCount), % actually sort by ascending # of occurrences transpose_pairs(NumToCount, CountToNumSorted), pairs_values(CountToNumSorted, PreferredRev), % we need higher values first reverse(PreferredRev, Preferred), % empty slots to fill, preferred first length(SubsetP, K), select_k(Preferred, SubsetP), % verify our selection it an actual subset of any of subsets sort(SubsetP, Subset), once((member(S, Subsets), ord_subtract(Subset, S, []))). select_k(_Subset, []). select_k(Subset, [E|R]) :- select(E, Subset, WithoutE), select_k(WithoutE, R). 

Test:

 ?- solve. [1,3] true ; [2,6,7] true. 
+1
source

Source: https://habr.com/ru/post/908700/


All Articles