An algorithm for efficiently selecting rows from a matrix, so the resulting column values ​​are equal

The practical application of this problem is a group task in the study of psychology, but the theoretical formulation is as follows:

I have a matrix (the actual matrix is ​​27x72, but I will take a 4x8 example):

1 0 1 0 0 1 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 1 0 1 1 0 0 0 1 0 1 

I want to extract half of the rows from this matrix so that the resulting column values ​​are equal (thus effectively creating two matrices with equivalent column sums). I cannot reorder the values ​​inside the rows.

I tried some brute force solutions, but my matrix is ​​too large for it to be effective, even choosing some random constraints first. It seems to me that the search space may be limited by the best algorithm, but so far I have not been able to think of one thing. Any ideas? It is also possible that there is no solution, so the algorithm would have to deal with this. I worked in R, but I could easily switch to python.

Update Found a solution thanks to ljeabmreosn. Karmarkar-Karp did a great job on the algorithm, and the conversion of strings to base 73 was inspired. It was very difficult for me to find a code that would actually give me subsequences, and not just the final difference (maybe most people are only interested in this problem in the abstract?). In any case, it was the code:

First I converted my lines to base 73 as a suggested poster. For this, I used the basic package in python, defining an alphabet with 73 characters, and then using the basein.decode function to convert to decimel.

For the algorithm, I just added code for printing subsequence indices from this mailing list message from Tim Peters: https://mail.python.org/pipermail/tutor/2001-August/008098.html

 from __future__ import nested_scopes import sys import bisect class _Num: def __init__(self, value, index): self.value = value self.i = index def __lt__(self, other): return self.value < other.value # This implements the Karmarkar-Karp heuristic for partitioning a set # in two, ie into two disjoint subsets st their sums are # approximately equal. It produces only one result, in O(N*log N) # time. A remarkable property is that it loves large sets: in # general, the more numbers you feed it, the better it does. class Partition: def __init__(self, nums): self.nums = nums sorted = [_Num(nums[i], i) for i in range(len(nums))] sorted.sort() self.sorted = sorted def run(self): sorted = self.sorted[:] N = len(sorted) connections = [[] for i in range(N)] while len(sorted) > 1: bigger = sorted.pop() smaller = sorted.pop() # Force these into different sets, by "drawing a # line" connecting them. i, j = bigger.i, smaller.i connections[i].append(j) connections[j].append(i) diff = bigger.value - smaller.value assert diff >= 0 bisect.insort(sorted, _Num(diff, i)) # Now sorted contains only 1 element x, and x.value is # the difference between the subsets' sums. # Theorem: The connections matrix represents a spanning tree # on the set of index nodes, and any tree can be 2-colored. # 2-color this one (with "colors" 0 and 1). index2color = [None] * N def color(i, c): if index2color[i] is not None: assert index2color[i] == c return index2color[i] = c for j in connections[i]: color(j, 1-c) color(0, 0) # Partition the indices by their colors. subsets = [[], []] for i in range(N): subsets[index2color[i]].append(i) return subsets if not sys.argv: print "error no arguments provided" elif sys.argv[1]: f = open(sys.argv[1], "r") x = [int(line.strip()) for line in f] N = 50 import math p = Partition(x) s, t = p.run() sum1 = 0L sum2 = 0L for i in s: sum1 += x[i] for i in t: sum2 += x[i] print "Set 1:" print s print "Set 2:" print t print "Set 1 sum", repr(sum1) print "Set 2 sum", repr(sum2) print "difference", repr(abs(sum1 - sum2)) 

This gives the following result:

 Set 1: [0, 3, 5, 6, 9, 10, 12, 15, 17, 19, 21, 22, 24, 26, 28, 31, 32, 34, 36, 38, 41, 43, 45, 47, 48, 51, 53, 54, 56, 59, 61, 62, 65, 66, 68, 71] Set 2: [1, 2, 4, 7, 8, 11, 13, 14, 16, 18, 20, 23, 25, 27, 29, 30, 33, 35, 37, 39, 40, 42, 44, 46, 49, 50, 52, 55, 57, 58, 60, 63, 64, 67, 69, 70] Set 1 sum 30309344369339288555041174435706422018348623853211009172L Set 2 sum 30309344369339288555041174435706422018348623853211009172L difference 0L 

Which provides indices of its own subsets in a few seconds. Thanks everyone!

+5
source share
1 answer

Assuming that each entry in the matrix can be 0 or 1, this problem seems to be in the same family as the Partition Problem , which has only a pseudopolynomial time algorithm. Let r be the number of rows in the matrix and c number of columns in the matrix. Then encode each line to a c -digit base number r+1 . This is done to ensure that when adding each encoding, there is no need to transfer, so the equivalent numbers in this database will be equal to two sets of rows whose column sums are equivalent. Thus, in your example, you will convert each line to a 4-digit base number 9. This will give numbers (converted to base 10):

1010 9 => 738 10

0101 9 => 82 10

1100 9 => 810 10

0110 9 => 90 10

0011 9 => 10 10

1010 9 => 738 10

1100 9 => 810 10

0101 9 => 82 10

Although you probably couldn't use the pseudo-polynomial time algorithm with this method, you can use simple heuristics with some decision trees to try to speed things up with brute force. Using the numbers above, you can try to use the Karmarkar-Karp heuristic . The following is the first step of the algorithm in Python 3:

 # Sorted (descending) => 810, 810, 738, 738, 90, 82, 82, 10 from queue import PriorityQueue def karmarkar_karp_partition(arr): pqueue = PriorityQueue() for e in arr: pqueue.put_nowait((-e, e)) for _ in range(len(arr)-1): _, first = pqueue.get_nowait() _, second = pqueue.get_nowait() diff = first - second pqueue.put_nowait((-diff, diff)) return pqueue.get_nowait()[1] 

The algorithm is fully implemented here. Please note that this method is simply heuristic and may not find a better section.

+2
source

Source: https://habr.com/ru/post/1270236/


All Articles