@moarningsun - thanks for the comment and response. This made me on the right track, but it doesn’t work for me when the space is <radius from the data edge: in this case I use the window around the gap cell, which is “cropped” for the data borders. In this situation, the indices reflect a “full” window and therefore cannot be used to select cells from a restricted window.
Unfortunately, I edited this part of my code when I clarified the original question, but it turned out to be useful.
Now I realized that if you use argsort again on the output of argsort , then you will get ranks; that is, the position that each element will have when the shared array has been sorted. We can hide them, and then take the smallest number_required from them (and do this on a structured array to get the corresponding data at the same time).
This implies a different type in the loop, but in fact we can use markup rather than full sorting, because all we need is the smallest num_required elements. If num_required significantly less than the number of data elements, then this is much faster than executing argsort .
For example, with num_required = 80 and num_available = 15000, a full argsort takes ~ 900 μs, while an argpartition , followed by an index and a slice, takes ~ 110 μs to get the first 80. We still need to do argsort to get ranks from the very beginning (and not just distance-based partitioning) in order to get merge stability, and thus get “correct” when the distance is not unique.
My code, as shown below, now runs on ~ 610uS on real data, including actual calculations that are not shown here. I am happy with this now, but there seem to be a few other seemingly minor factors that can affect the runtime, which is hard to understand.
For example, putting circle_template in a structured array along with dist , ranks , and another field not shown here doubles the execution time of a common function (even if we don't get access to circle_template in a loop!). Even worse, using np.partition in a structured array with order=['ranks'] increases the total duration of a function by almost two orders of magnitude against using np.argpartition , as shown below!
# radius will in reality be ~100 radius = 2 y,x = np.ogrid[-radius:radius+1, -radius:radius+1] dist = np.sqrt(x**2 + y**2) circle_template = dist > radius ranks = dist.argsort(axis=None,kind='mergesort').argsort().reshape(dist.shape) diam = radius * 2 + 1