NumPy: Calculates the cumulative median

Question

NumPy: Calculates the cumulative median

I have a sample with size = n.

I want to calculate for each i: 1 <= i <= n the median for sample[:i] in numpy. For example, I calculated the value for each i:

cummean = np.cumsum(sample) / np.arange(1, n + 1)

Is it possible to do something similar for a median without cycles and understanding?

+5

python vectorization numpy statistics

Artem kupriyanov Mar 13 '17 at 2:04

source share

4 answers

Use statistics.median and cumulative list comprehension (note that odd indices contain medians of even-length lists, where the median is the average of two median elements, so this usually results in decimal rather than integer):

 >>> from statistics import median >>> arr = [1, 3, 4, 2, 5, 3, 6] >>> cum_median = [median(arr[:i+1]) for i in range(len(arr)-1)] >>> cum_median [1, 2.0, 3, 2.5, 3, 3.0]

+2

Uriel Mar 13 '17 at 14:08

source share

Knowing that Python has a heapq module that allows you to do a “minimum” for iteration, I searched in heapq and median and found various elements for steaming medium . It:

http://www.ardendertat.com/2011/11/03/programming-interview-questions-13-median-of-integer-stream/

has a class streamMedian that supports two heapq , one with the bottom half of the values, the other with the top half. The median is either the “peak” of one or the average of both. The class has an insert method and a getMedian method. Most of the work is in insert .

I copied this to an Ipython session and determined:

 def cummedian_stream(b): S=streamMedian() ret = [] for item in b: S.insert(item) ret.append(S.getMedian()) return np.array(ret)

Testing:

 In [155]: a = np.random.randint(0,100,(5000)) In [156]: amed = cummedian_stream(a) In [157]: np.allclose(cummedian_sorted(a), amed) Out[157]: True In [158]: timeit cummedian_sorted(a) 1 loop, best of 3: 781 ms per loop In [159]: timeit cummedian_stream(a) 10 loops, best of 3: 39.6 ms per loop

The heapq stream heapq is faster.

Accounting for the list that @Uriel gave is relatively slow. But if I substitute np.median for statistics.median , this is faster than @Divakar's sorted solution:

 def fastloop(a): return np.array([np.median(a[:i+1]) for i in range(len(a))]) In [161]: timeit fastloop(a) 1 loop, best of 3: 360 ms per loop

And @Paul Panzer's section approach is also good, but still slow compared to the stream class.

 In [165]: timeit cummedian_partition(a) 1 loop, best of 3: 391 ms per loop

(I could copy the streamMedian class into this answer if necessary).

+2

hpaulj Mar 13 '17 at 23:15

source share

Is there a place for late recording?

 def cummedian_partition(a): n = len(a) assert n%4 == 0 # for simplicity mn = a.min() - 1 mx = a.max() + 1 h = n//2 N = n + h//2 work = np.empty((h, N), a.dtype) work[:, :n] = a work[:, n] = 2*mn - a[0] i, j = np.tril_indices(h, -1) work[i, n-1-j] = (2*mn - a[1:h+1])[j] k, l = np.ogrid[:h, :h//2 - 1] work[:, n+1:] = np.where(k > 2*l+1, mx, 2 * mn - mx) out = np.partition(work, (Nn//2-1, Nn//2, h//2-1, h//2), axis=-1) out = np.r_[2*mn-out[:, h//2: h//2-2:-1], out[::-1, Nn//2-1:Nn//2+1]] out[::2, 0] = out[::2, 1] return np.mean(out, axis=-1)

The algorithm uses a separation having linear complexity. Some gymnastics is required because np.partition does not support line split points. Combined complexity and required memory are quadratic.

Timing compared to the best at present:

 for j in (100, 1000, 5000): a = np.random.randint(0, 100, (j,)) print('size', j) print('correct', np.allclose(cummedian_partition(a), cummedian_sorted(a))) print('Divakar', timeit(lambda: cummedian_sorted(a), number=10)) print('PP', timeit(lambda: cummedian_partition(a), number=10)) # size 100 # correct True # Divakar 0.0022412699763663113 # PP 0.002393342030700296 # size 1000 # correct True # Divakar 0.20881508802995086 # PP 0.10222102201078087 # size 5000 # correct True # Divakar 6.158387024013791 # PP 3.437395485001616

+1

Paul panzer Mar 13 '17 at 20:09

source share

Divakar · Accepted Answer · 2017-03-13T17:23:20+0000

Here's an approach that replicates elements along rows to give us an array of 2D . Then we would fill the upper triangular region with a large number, so that later, when we sorted the array along each row, basically we would sort all the elements to diagonal elements and simulate the aggregate windows. Then, following the definition of median , choosing the average or average of two means (for an even number of elements), we get the elements in the first position: (0,0) , then for the second row: the average value (1,0) & (1,1) , for the third row: (2,1) , for the fourth row: the average of (3,1) & (3,2) and so on. So, we will extract these elements from the sorted array and, therefore, get our median values.

So the implementation will be -

 def cummedian_sorted(a): n = a.size maxn = a.max()+1 a_tiled_sorted = np.tile(a,n).reshape(-1,n) mask = np.triu(np.ones((n,n),dtype=bool),1) a_tiled_sorted[mask] = maxn a_tiled_sorted.sort(1) all_rows = a_tiled_sorted[np.arange(n), np.arange(n)//2].astype(float) idx = np.arange(1,n,2) even_rows = a_tiled_sorted[idx, np.arange(1,1+(n//2))] all_rows[idx] += even_rows all_rows[1::2] /= 2.0 return all_rows

Runtime test

Approaches -

 # Loopy solution from @Uriel soln def cummedian_loopy(arr): return [median(a[:i]) for i in range(1,len(a)+1)] # Nan-fill based solution from @Nickil Maveli soln def cummedian_nanfill(arr): a = np.tril(arr).astype(float) a[np.triu_indices(a.shape[0], k=1)] = np.nan return np.nanmedian(a, axis=1)

Dates -

Install # 1:

 In [43]: a = np.random.randint(0,100,(100)) In [44]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a)) ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a)) ...: True True In [45]: %timeit cummedian_loopy(a) ...: %timeit cummedian_nanfill(a) ...: %timeit cummedian_sorted(a) ...: 1000 loops, best of 3: 856 µs per loop 1000 loops, best of 3: 778 µs per loop 10000 loops, best of 3: 200 µs per loop

Install # 2:

 In [46]: a = np.random.randint(0,100,(1000)) In [47]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a)) ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a)) ...: True True In [48]: %timeit cummedian_loopy(a) ...: %timeit cummedian_nanfill(a) ...: %timeit cummedian_sorted(a) ...: 10 loops, best of 3: 118 ms per loop 10 loops, best of 3: 47.6 ms per loop 100 loops, best of 3: 18.8 ms per loop

Install # 3:

 In [49]: a = np.random.randint(0,100,(5000)) In [50]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a)) ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a)) True True In [54]: %timeit cummedian_loopy(a) ...: %timeit cummedian_nanfill(a) ...: %timeit cummedian_sorted(a) ...: 1 loops, best of 3: 3.36 s per loop 1 loops, best of 3: 583 ms per loop 1 loops, best of 3: 521 ms per loop

NumPy: Calculates the cumulative median

More articles: