Matrix median with sorted rows

I cannot solve the following problem optimally and cannot find an approach to do this anywhere.

Given the N × M matrix in which each row is sorted, find the common median of the matrix. Suppose that N * M is odd.

For instance,

Matrix =
[1, 3, 5]
[2, 6, 9]
[3, 6, 9]

A = [1, 2, 3, 3, 5, 6, 6, 9, 9]

The median is 5. So, we return 5.
Note: additional memory is not allowed.

Any help would be appreciated.

+6
source share
6 answers

Consider the following process.

  • If we consider the N * M matrix as a 1-D array, then the median is an element of the 1+N*M/2 element.

  • Then we consider that x will be median if x is a matrix element and the number of matrix elements ≤ x is 1 + N*M/2 .

  • When the matrix elements in each row are sorted, you can easily find the number of elements in each row less than or equals x . To find the whole matrix, the complexity is N*log M with binary search.

  • Then first find the minimum and maximum elements from the N * M matrix. Apply a binary search in this range and run the above function for each x.

  • If the number of elements in the matrix ≤ x is 1 + N*M/2 , and x contains in this matrix, then x is the median.

You can consider this C ++ code below:

 int median(vector<vector<int> > &A) { int min = A[0][0], max = A[0][0]; int n = A.size(), m = A[0].size(); for (int i = 0; i < n; ++i) { if (A[i][0] < min) min = A[i][0]; if (A[i][m-1] > max) max = A[i][m-1]; } int element = (n * m + 1) / 2; while (min < max) { int mid = min + (max - min) / 2; int cnt = 0; for (int i = 0; i < n; ++i) cnt += upper_bound(&A[i][0], &A[i][m], mid) - &A[i][0]; if (cnt < element) min = mid + 1; else max = mid; } return min; } 
+9
source

A simple solution to O (1) memory is to check whether each individual z element is median. To do this, we find the position of z in all rows, simply accumulating the number of elements smaller than z. This does not use the fact that each line is sorted, except for finding the position z in each line in O (log M) time. For each element, we need to perform N * log M comparisons, and there are N * M elements, so this is N²M log M.

+1
source

If the matrix elements are integers, you can binary search for the median, starting with the matrix range for hi and low. O (n log m log (hi-low)).

Otherwise, one of the methods that has O (n? Log2? M) time complexity is a binary search, O (log m), for each line, in turn, O (n), the closest element to the common matrix is ​​the median on the left and the closest on the right, O (n log m), updating the best so far. We know that the total median has no more than floor(m * n / 2) elements strictly smaller than it, and that adding the number of elements is less than it and the number of times it has can be no less than floor(m * n / 2) + 1 . We use a standard binary search in a string, and as greybeard pointed out, skip the test for items outside our "best" range. The test for how close an element is to the common median involves counting the number of elements in each row strictly less, and how many equal, which is achieved in O(n log m) time with binary searches n . Since the string is sorted, we know that higher elements will be more “right” and smaller elements “left” with respect to the common median.

If it is allowed to reorder the matrix, then the time complexity of O (mn log (mn)) is possible by sorting the matrix in place (for example, by sorting the blocks) and returning the middle element.

+1
source

There is a randomized algorithm that solves this problem in O (n (log n) (log m)) time. This is a Las Vegas algorithm , which means that it always gives the correct results, but may take longer than expected. In this case, the likelihood that it takes much longer than expected is extremely small.

For m = 1, this problem boils down to the task of finding the median in a read-only array using constant space. This problem does not have a known optimal solution: see "Finding a median in read-only memory based on integer input, Chan et al."

One thing related to this reduction of the problem for m = 1 is that this case sub is also super , since the algorithm for m = 1 can be applied to case m> 1. The idea is to just forget that the lines arrays are sorted and process the entire storage area as an unsorted array of size n * m. So, for example, the trivial algorithm for the case m = 1, in which each element is checked whether it is median, takes O (n 2 ) time. Applying it when m> 1 takes O (n 2 m 2 ) time.

Returning to the case m = 1, in the comparison model (in which the elements of the array can be integers, strings, real numbers or anything else that can be compared with the inequality operators "<",> "), the most well-known deterministic solution using the space s (where s is a constant, i.e., in O (1)) has the time Θ (2 s s! n 1 + 1 / s ), and this is more complicated than the usual algorithms discussed in stackoverflow (although not on https://cstheory.stackexchange.com or https://cs.stackexchange.com ). It uses the algorithm chain A s , A s-1 , ..., A 1 , where A s + 1 calls A s . You can read it in "Choice from read-only memory and sorting with minimal data movement, "Munroe and Raman .

There is a simple randomized algorithm with shorter execution time with a high probability. For any constant c, this algorithm runs in time O (n log n) with probability 1 - O (n -c ). When the array is an n * m sized matrix that works with O (nm log (nm)).

This algorithm is very similar to quickselect without rearranging elements during splitting.

 import random def index_range(needle, haystack): """The index range' of a value over an array is a pair consisting of the number of elements in the array less than that value and the number of elements in the array less than or equal to the value. """ less = same = 0 for x in haystack: if x < needle: less += 1 elif x == needle: same += 1 return less, less + same def median(xs): """Finds the median of xs using O(1) extra space. Does not alter xs. """ if not xs: return None # First, find the minimum and maximum of the array and # their index ranges: lo, hi = min(xs), max(xs) lo_begin, lo_end = index_range(lo, xs) hi_begin, hi_end = index_range(hi, xs) # Gradually we will move the lo and hi index ranges closer # to the median. mid_idx = len(xs)//2 while True: print "range size", hi_begin - lo_end if lo_begin <= mid_idx < lo_end: return lo if hi_begin <= mid_idx < hi_end: return hi assert hi_begin - lo_end > 0 # Loop over the array, inspecting each item between lo # and hi. This loops sole purpose is to reservoir sample # from that set. This makes res a randomly selected # element from among those strictly between lo and hi in # xs: res_size = 0 res = None for x in xs: if lo < x < hi: res_size += 1 if 1 == random.randint(1, res_size): res = x assert res is not None assert hi_begin - lo_end == res_size # Now find which size of the median res is on and # continue the search on the smaller region: res_begin, res_end = index_range(res, xs) if res_end > mid_idx: hi, hi_begin, hi_end = res, res_begin, res_end else: lo, lo_begin, lo_end = res, res_begin, res_end 

It works by preserving the upper and lower limits of the median value. He then iterates over the array and randomly selects a value between the borders. This value replaces one of the boundaries, and the process begins again.

Estimates are accompanied by a range of their indices, the measure of which indices will be displayed when sorting the array. As soon as one of the boundaries appears in the index ⌊n / 2⌋, this is the median and the algorithm ends.

When an element is randomly selected between the borders, this reduces the gap by 50% in anticipation. The algorithm ends (no later) when the gap is 0. We can model this as a series of random independent uniformly distributed variables X i from (0,1) such that Y k = X 1 * X 2 * ... * X k , where X i is the space ratio that remains after round i. For example, if after the 10th round, the gap between the ranges of the indices lo and hi is 120, and after the 11th round, the gap is 90, then X 11 = 0.75. The algorithm ends when Y k <1 / n, since the gap is then less than 1.

Choose a constant positive integer k. Let the probability that Y k log 2 n > = 1 / n be connected using the Chernov bounds. We have Y k log 2 n = X 1 * X 2 * ... X k log 2 n , therefore ln Y k log 2 n = ln X 1 + ln X 2 + ... + ln X k log 2 n . Then the Chernov boundary gives Pr (ln X 1 + ln X 2 + ... + ln X k log 2 n > = ln (1 / n)) <= min t> 0 e -t ln (1 / n) (E [e t ln X 1 ] * E [e t ln X 2 ] * ... * E [e t ln X k log 2 n ]). After some simplification, the right-hand side is min t> 0 n t (E [X 1 t ] * E [X 2 t ] * ... * E [X k log 2 n t ]). Since this is a minimum and we are looking for an upper bound, we can weaken this by specializing in t = 1. Then it simplifies to n 1-k since E [X i ] = 1/2.

If we choose, for example, k = 6, then this limits the probability that there are 6 log 2 n rounds or more by n -5 . Thus, with a probability of 1 - O (n -5 ), the algorithm performs 6 log 2 n - 1 or fewer rounds. This is what I mean by "high probability" above.

Since each round checks each member of the array a constant number of times, each round takes linear time, for the full run time O (n log n) with a high probability. When an array is not just an array, but an n * m matrix that works with O (nm log (nm)).

However, we can do much better by taking advantage of string sorting. When we were working in one unsorted array, searching for elements in a space, which I referred to above, required checking each element of the array. In a matrix with sorted rows, the elements in the gap are located in the adjacent segment of each row. Each segment can be identified in O (log m) time using binary search, so all of them can be located in O (n log m) time. Collector sampling now takes O (n log m) time for loop iteration.

Another major work performed in the loop is to identify the index range of an element from a gap that was randomly selected. Again, since each row is sorted, the range of indices for a randomly selected item in a row can be specified in O (log m) time. The sums of the index ranges for each row make up the index range over the entire array, so this part of each iteration of the loop also takes only O (n log m) time.

With the same argument as above, with the Chernov boundary, iterations O (log n) exist with a probability of at least 1-O (n -k ) for any constant k. Thus, the entire algorithm with high probability takes O (n (log n) (log m)) time.

 import bisect import random def matrix_index_range(needle, haystack): """matrix_index_range calculates the index range of needle in a haystack that is a matrix (stored in row-major order) in which each row is sorted""" n, m = len(haystack), len(haystack[0]) begin = end = 0; for x in haystack: begin += bisect.bisect_left(x, needle) end += bisect.bisect_right(x, needle) return begin, end def matrix_median(xs): print "Starting" if not xs or not xs[0]: return None n, m = len(xs), len(xs[0]) lo, hi = xs[0][0], xs[0][m-1] for x in xs: lo, hi = min(lo, x[0]), max(hi, x[m-1]) lo_begin, lo_end = matrix_index_range(lo, xs) hi_begin, hi_end = matrix_index_range(hi, xs) mid_idx = (n * m) // 2 while True: print "range size", hi_begin - lo_end if lo_begin <= mid_idx < lo_end: return lo if hi_begin <= mid_idx < hi_end: return hi assert hi_begin - lo_end > 0 mid = None midth = random.randint(0, hi_begin - lo_end - 1) for x in xs: gap_begin = bisect.bisect_right(x, lo) gap_end = bisect.bisect_left(x, hi) gap_size = gap_end - gap_begin if midth < gap_size: mid = x[gap_begin + midth] break midth -= gap_size assert mid is not None mid_begin, mid_end = matrix_index_range(mid, xs) assert lo_end <= mid_begin and mid_end <= hi_begin if mid_end > mid_idx: hi, hi_begin, hi_end = mid, mid_begin, mid_end else: lo, lo_begin, lo_end = mid, mid_begin, mid_end 

This solution is significantly faster than the first when m is inconsistent.

+1
source

I coded a time solution O (n 2 log 2 m) גלעד ברקן, but they asked me not to add code to their answer, so here it is as a separate answer:

 import bisect def MedianDistance(key, matrix): lo = hi = 0 for row in matrix: lo += bisect.bisect_left(row, key) hi += bisect.bisect_right(row, key) mid = len(matrix) * len(matrix[0]) // 2; if hi - 1 < mid: return hi - 1 - mid if lo > mid: return lo - mid return 0 def ZeroInSorted(row, measure): lo, hi = -1, len(row) while hi - lo > 1: mid = (lo + hi) // 2 ans = measure(row[mid]) if ans < 0: lo = mid elif ans == 0: return mid else: hi = mid def MatrixMedian(matrix): measure = lambda x: MedianDistance(x, matrix) for idx, row in enumerate(matrix): if not idx & idx-1: print(idx) ans = ZeroInSorted(row, measure) if ans is not None: return row[ans] 
+1
source

sunkuet02 answer with refinements and python code:
Each row of the N × M A matrix is ​​sorted and has a middle element, which is its median.
There are at least N * (M + 1) / 2 elements not exceeding the maximum hi of these medians, and at least N * (M + 1) / 2 not less than the minimum lo:
the median of all elements of A must be between lo and hi, inclusive.
As you know, more than half of the elements are known to be lower than the current candidate, the latter, as you know, is high. As soon as there are too few lines to count the elements below the current candidate, to reach half the total number, the candidate is known to be low: in both cases, immediately proceed to the next candidate.

 from bisect import bisect def median(A): """ returns the median of all elements in A. Each row of A needs to be in ascending order. """ # overall median is between min and max row median lo, hi = minimax(A) n = len(A) middle_row = n // 2 columns = len(A[0]) half = (n * columns + 1) // 2 while lo < hi: mid = lo + (hi - lo) // 2 lower = 0 # first half can't decide median for a in A[:middle_row]: lower += bisect(a, mid) # break as soon as mid is known to be too high or low for r, a in enumerate(A[middle_row:n-1]): lower += bisect(a, mid) if half <= lower: hi = mid break if lower < r*columns: lo = mid + 1 break else: # decision in last row lower += bisect(A[n-1], mid) if half <= lower: hi = mid else: lo = mid + 1 return lo def minmax(x, y): """return min(x, y), max(x, y)""" if x < y: return x, y return y, x def minimax(A): """ return min(A[0..m][n//2]), max(A[0..m][n//2]): minimum and maximum of medians if A is a row major matrix with sorted rows.""" n = len(A) half = n // 2 if n % 2: lo = hi = A[0][half] else: lo, hi = minmax(A[0][half], A[1][half]) for i in range(2-n % 2, len(A[0]), 2): l, h = minmax(A[i][half], A[i+1][half]) if l < lo: lo = l if hi< h: hi = h return lo, hi if __name__ =='__main__': print(median( [[1, 3, 5], [2, 6, 9], [3, 6, 9]] )) 

(I believe that std::upper_bound() and bisect.bisect() equivalent ( bisect_right() is an alias).) For the second median candidate, the last processed line may be lower than at the first iteration. In the following iterations, this serial number should never decrease - too lazy to affect the fact that ((rename and) increase middle_row as necessary).

0
source

Source: https://habr.com/ru/post/1013680/


All Articles