There is a randomized algorithm that solves this problem in O (n (log n) (log m)) time. This is a Las Vegas algorithm , which means that it always gives the correct results, but may take longer than expected. In this case, the likelihood that it takes much longer than expected is extremely small.
For m = 1, this problem boils down to the task of finding the median in a read-only array using constant space. This problem does not have a known optimal solution: see "Finding a median in read-only memory based on integer input, Chan et al."
One thing related to this reduction of the problem for m = 1 is that this case sub is also super , since the algorithm for m = 1 can be applied to case m> 1. The idea is to just forget that the lines arrays are sorted and process the entire storage area as an unsorted array of size n * m. So, for example, the trivial algorithm for the case m = 1, in which each element is checked whether it is median, takes O (n 2 ) time. Applying it when m> 1 takes O (n 2 m 2 ) time.
Returning to the case m = 1, in the comparison model (in which the elements of the array can be integers, strings, real numbers or anything else that can be compared with the inequality operators "<",> "), the most well-known deterministic solution using the space s (where s is a constant, i.e., in O (1)) has the time Θ (2 s s! n 1 + 1 / s ), and this is more complicated than the usual algorithms discussed in stackoverflow (although not on https://cstheory.stackexchange.com or https://cs.stackexchange.com ). It uses the algorithm chain A s , A s-1 , ..., A 1 , where A s + 1 calls A s . You can read it in "Choice from read-only memory and sorting with minimal data movement, "Munroe and Raman .
There is a simple randomized algorithm with shorter execution time with a high probability. For any constant c, this algorithm runs in time O (n log n) with probability 1 - O (n -c ). When the array is an n * m sized matrix that works with O (nm log (nm)).
This algorithm is very similar to quickselect without rearranging elements during splitting.
import random def index_range(needle, haystack): """The index range' of a value over an array is a pair consisting of the number of elements in the array less than that value and the number of elements in the array less than or equal to the value. """ less = same = 0 for x in haystack: if x < needle: less += 1 elif x == needle: same += 1 return less, less + same def median(xs): """Finds the median of xs using O(1) extra space. Does not alter xs. """ if not xs: return None
It works by preserving the upper and lower limits of the median value. He then iterates over the array and randomly selects a value between the borders. This value replaces one of the boundaries, and the process begins again.
Estimates are accompanied by a range of their indices, the measure of which indices will be displayed when sorting the array. As soon as one of the boundaries appears in the index ⌊n / 2⌋, this is the median and the algorithm ends.
When an element is randomly selected between the borders, this reduces the gap by 50% in anticipation. The algorithm ends (no later) when the gap is 0. We can model this as a series of random independent uniformly distributed variables X i from (0,1) such that Y k = X 1 * X 2 * ... * X k , where X i is the space ratio that remains after round i. For example, if after the 10th round, the gap between the ranges of the indices lo
and hi
is 120, and after the 11th round, the gap is 90, then X 11 = 0.75. The algorithm ends when Y k <1 / n, since the gap is then less than 1.
Choose a constant positive integer k. Let the probability that Y k log 2 n > = 1 / n be connected using the Chernov bounds. We have Y k log 2 n = X 1 * X 2 * ... X k log 2 n , therefore ln Y k log 2 n = ln X 1 + ln X 2 + ... + ln X k log 2 n . Then the Chernov boundary gives Pr (ln X 1 + ln X 2 + ... + ln X k log 2 n > = ln (1 / n)) <= min t> 0 e -t ln (1 / n) (E [e t ln X 1 ] * E [e t ln X 2 ] * ... * E [e t ln X k log 2 n ]). After some simplification, the right-hand side is min t> 0 n t (E [X 1 t ] * E [X 2 t ] * ... * E [X k log 2 n t ]). Since this is a minimum and we are looking for an upper bound, we can weaken this by specializing in t = 1. Then it simplifies to n 1-k since E [X i ] = 1/2.
If we choose, for example, k = 6, then this limits the probability that there are 6 log 2 n rounds or more by n -5 . Thus, with a probability of 1 - O (n -5 ), the algorithm performs 6 log 2 n - 1 or fewer rounds. This is what I mean by "high probability" above.
Since each round checks each member of the array a constant number of times, each round takes linear time, for the full run time O (n log n) with a high probability. When an array is not just an array, but an n * m matrix that works with O (nm log (nm)).
However, we can do much better by taking advantage of string sorting. When we were working in one unsorted array, searching for elements in a space, which I referred to above, required checking each element of the array. In a matrix with sorted rows, the elements in the gap are located in the adjacent segment of each row. Each segment can be identified in O (log m) time using binary search, so all of them can be located in O (n log m) time. Collector sampling now takes O (n log m) time for loop iteration.
Another major work performed in the loop is to identify the index range of an element from a gap that was randomly selected. Again, since each row is sorted, the range of indices for a randomly selected item in a row can be specified in O (log m) time. The sums of the index ranges for each row make up the index range over the entire array, so this part of each iteration of the loop also takes only O (n log m) time.
With the same argument as above, with the Chernov boundary, iterations O (log n) exist with a probability of at least 1-O (n -k ) for any constant k. Thus, the entire algorithm with high probability takes O (n (log n) (log m)) time.
import bisect import random def matrix_index_range(needle, haystack): """matrix_index_range calculates the index range of needle in a haystack that is a matrix (stored in row-major order) in which each row is sorted""" n, m = len(haystack), len(haystack[0]) begin = end = 0; for x in haystack: begin += bisect.bisect_left(x, needle) end += bisect.bisect_right(x, needle) return begin, end def matrix_median(xs): print "Starting" if not xs or not xs[0]: return None n, m = len(xs), len(xs[0]) lo, hi = xs[0][0], xs[0][m-1] for x in xs: lo, hi = min(lo, x[0]), max(hi, x[m-1]) lo_begin, lo_end = matrix_index_range(lo, xs) hi_begin, hi_end = matrix_index_range(hi, xs) mid_idx = (n * m) // 2 while True: print "range size", hi_begin - lo_end if lo_begin <= mid_idx < lo_end: return lo if hi_begin <= mid_idx < hi_end: return hi assert hi_begin - lo_end > 0 mid = None midth = random.randint(0, hi_begin - lo_end - 1) for x in xs: gap_begin = bisect.bisect_right(x, lo) gap_end = bisect.bisect_left(x, hi) gap_size = gap_end - gap_begin if midth < gap_size: mid = x[gap_begin + midth] break midth -= gap_size assert mid is not None mid_begin, mid_end = matrix_index_range(mid, xs) assert lo_end <= mid_begin and mid_end <= hi_begin if mid_end > mid_idx: hi, hi_begin, hi_end = mid, mid_begin, mid_end else: lo, lo_begin, lo_end = mid, mid_begin, mid_end
This solution is significantly faster than the first when m is inconsistent.