Effectively get the minimum values ​​for each pair of elements from two arrays in the third array

I have two floating point N arrays (which act as (x,y) coordinates and can have duplicates) and a linked z array of floating N (which act as weights for the coordinates).

For each pair of floating (x,y) I need to select the pair with the lowest associated value of z . I defined a selectMinz() function that does this (see the code below), but it takes too much time.

How can I improve the performance of this feature?

 import numpy as np import time def getData(): N = 100000 x = np.arange(0.0005, 0.03, 0.001) y = np.arange(6., 10., .05) # Select N values for x,y, where values can be repeated x = np.random.choice(x, N) y = np.random.choice(y, N) z = np.random.uniform(10., 15., N) return x, y, z def selectMinz(x, y, z): """ Select the minimum z for each (x,y) pair. """ xy_unq, z_unq = [], [] # For each (x,y) pair for i, xy in enumerate(zip(*[x, y])): # If this xy pair was already stored in the xy_unq list if xy in xy_unq: # If the stored z value associated with this xy pair is # larger than this new z[i] value if z_unq[xy_unq.index(xy)] > z[i]: # Store this smaller value instead z_unq[xy_unq.index(xy)] = z[i] else: # Store the xy pair, and its associated z value xy_unq.append(xy) z_unq.append(z[i]) return xy_unq, z_unq # Define data with the proper format. x, y, z = getData() s = time.clock() xy_unq, z_unq = selectMinz(x, y, z) # <-- TAKES TOO LONG (~15s in my system) print(time.clock() - s) 
+5
source share
3 answers

Steps:

  • Use lex-sort to get xy pairs in sequence. Or we can use the scaling method to scale one of the arrays over a range of values ​​on the other, and then sum it with another array and finally use argsort to get the lex-equivalent equivalent indices.
  • Use np.minimum.reduceat to get the minimum values ​​in the intervals defined by paired groupings.

Thus, we would have one vectorized solution, for example:

 def selectMinz_vectorized(x, y, z): # Get grouped lex-sort indices sidx = (y + x*(y.max() - y.min() + 1)).argsort() # or sidx = np.lexsort([x, y]) # Lex-sort x, y, z x_sorted = x[sidx] y_sorted = y[sidx] z_sorted = z[sidx] # Get equality mask between each sorted X and Y elem against previous ones. # The non-zero indices of its inverted mask gives us the indices where the # new groupings start. We are calling those as cut_idx. seq_eq_mask = (x_sorted[1:] == x_sorted[:-1]) & (y_sorted[1:] == y_sorted[:-1]) cut_idx = np.flatnonzero(np.concatenate(( [True], ~seq_eq_mask))) # Use those cut_idx to get intervalled minimum values minZ = np.minimum.reduceat(z_sorted, cut_idx) # Make tuples of the groupings of x,y and the corresponding min Z values return (zip(x_sorted[cut_idx], y_sorted[cut_idx]), minZ.tolist()) 

Run Example -

 In [120]: np.c_[x,y,z] Out[120]: array([[ 0., 1., 69.], [ 2., 0., 47.], [ 1., 0., 62.], [ 0., 2., 33.], [ 1., 7., 32.], [ 1., 0., 50.], [ 2., 0., 55.]]) In [121]: selectMinz(x,y,z) # original method Out[121]: ([(0.0, 1.0), (2.0, 0.0), (1.0, 0.0), (0.0, 2.0), (1.0, 7.0)], [69.0, 47.0, 50.0, 33.0, 32.0]) In [122]: selectMinz_vectorized(x,y,z) Out[122]: ([(1.0, 0.0), (2.0, 0.0), (0.0, 1.0), (0.0, 2.0), (1.0, 7.0)], [50.0, 47.0, 69.0, 33.0, 32.0]) 

Here is my initial approach, which included creating a complex array, and then doing these operations. The implementation looked something like this:

 def selectMinz_vectorized_v2(x, y, z): d = np.column_stack((x,y,z)) sidx = np.lexsort(d[:,:2].T) b = d[sidx] cut_idx = np.r_[0,np.flatnonzero(~(b[1:,:2] == b[:-1,:2]).all(1))+1] minZ = np.minimum.reduceat(b[:,-1], cut_idx) return ([tuple(i) for i in b[cut_idx,:2]], minZ.tolist()) 

Benchmarking for vectorized approaches

Approaches -

 # Pruned version of the approach posted earlier def selectMinz_vectorized_pruned(x, y, z): sidx = (y + x*(y.max() - y.min() + 1)).argsort() x_sorted = x[sidx] y_sorted = y[sidx] z_sorted = z[sidx] seq_eq_mask = (x_sorted[1:] == x_sorted[:-1]) & (y_sorted[1:] == y_sorted[:-1]) cut_idx = np.flatnonzero(np.concatenate(( [True], ~seq_eq_mask))) minZ = np.minimum.reduceat(z_sorted, cut_idx) return x_sorted[cut_idx], y_sorted[cut_idx], minZ def numpy_indexed_app(x,y,z): # @Eelco Hoogendoorn soln return npi.group_by((x, y)).min(z) 

Dates -

 In [141]: x,y,z=getData(10000) In [142]: %timeit selectMinz_vectorized_pruned(x, y, z) ...: %timeit numpy_indexed_app(x,y,z) ...: 1000 loops, best of 3: 763 Β΅s per loop 1000 loops, best of 3: 1.09 ms per loop In [143]: x,y,z=getData(100000) In [144]: %timeit selectMinz_vectorized_pruned(x, y, z) ...: %timeit numpy_indexed_app(x,y,z) ...: 100 loops, best of 3: 8.53 ms per loop 100 loops, best of 3: 12.9 ms per loop 
+3
source

Changing the data structure of xy_unq and z_unq in a dictionary containing both pieces of information reduced the time from ~ 7 to ~ 0.1 s in my system.

 def selectMinz(x, y, z): """ Select the minimum z for each (x,y) pair. """ xy_unq = {} # For each (x,y) pair for i, xy in enumerate(zip(*[x, y])): # If this xy pair was already stored in the xy_unq list if xy in xy_unq: # If the stored z value associated with this xy pair is # larger than this new z[i] value if xy_unq[xy] > z[i]: # Store this smaller value instead xy_unq[xy] = z[i] else: # Store the xy pair, and its associated z value xy_unq[xy] = z[i] return xy_unq.keys(), xy_unq.values() 

The time for the above approach for me ranged from ~ 0.106 to ~ 0.11 s. Here is an alternative approach with fewer lines of code, but takes a little longer (~ 0.14):

 def selectMinz(x, y, z): """ Select the minimum z for each (x,y) pair. """ xy_unq = {} # For each (x,y) pair for i, xy in enumerate(zip(*[x, y])): # If this xy pair was already stored in the xy_unq list if xy in xy_unq: # Store the value that is smaller between the current stored value and the new z[i] xy_unq[xy] = min(xy_unq[xy], z[i]) else: # Store the xy pair, and its associated z value xy_unq[xy] = z[i] return xy_unq.keys(), xy_unq.values() 
+3
source

The numpy_indexed package (disclaimer: I am the author) contains the functionality to solve these grouping problems in an elegant and efficient way:

 import numpy_indexed as npi xy_unq, z_unq = npi.group_by((x, y)).min(z) 
+2
source

Source: https://habr.com/ru/post/1271240/


All Articles