NumPy array index by row

Let's say I have a NumPy array:

>>> X = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) >>> X array([[ 1, 2, 3, 4], [ 5, 6, 7, 8], [ 9, 10, 11, 12]]) 

and an array of indexes that I want to select for each row:

 >>> ixs = np.array([[1, 3], [0, 1], [1, 2]]) >>> ixs array([[1, 3], [0, 1], [1, 2]]) 

How to index an array X so that for each row in X I select two indexes specified in ixs ?

So, for this case, I want to select elements 1 and 3 for the first row, element 0 and 1 for the second row, etc. The output should be:

 array([[2, 4], [5, 6], [10, 11]]) 

A slow solution would be something like this:

output = np.array([row[ix] for row, ix in zip(X, ixs)])

however, this can become slow for extremely long arrays. Is there a faster way to do this without a loop using NumPy?

EDIT: Some very rough speed tests in a 2.5K * 1M array with a width of 2K (10 GB):

np.array([row[ix] for row, ix in zip(X, ixs)]) 0.16 s

X[np.arange(len(ixs)), ixs.T].T 0.175s

X.take(idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None]) 33s

np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype).reshape(ixs.shape) 2.4s

+5
source share
4 answers

You can use this:

 X[np.arange(len(ixs)), ixs.T].T 

Here is a link for complex indexing.

+6
source

I believe you can use .take as follows:

 In [185]: X Out[185]: array([[ 1, 2, 3, 4], [ 5, 6, 7, 8], [ 9, 10, 11, 12]]) In [186]: idx Out[186]: array([[1, 3], [0, 1], [1, 2]]) In [187]: X.take(idx + (np.arange(X.shape[0]) * X.shape[1]).reshape(-1, 1)) Out[187]: array([[ 2, 4], [ 5, 6], [10, 11]]) 

If the dimensions of the array are massive, it can be faster, though uglier:

 idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None] 

Just for fun, see how the following goes:

 np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype, count=ixs.size).reshape(ixs.shape) 

Change adding timings

 In [15]: X = np.arange(1000*10000, dtype=np.int32).reshape(1000,-1) In [16]: ixs = np.random.randint(0, 10000, (1000, 2)) In [17]: ixs.sort(axis=1) In [18]: ixs Out[18]: array([[2738, 3511], [3600, 7414], [7426, 9851], ..., [1654, 8252], [2194, 8200], [5497, 8900]]) In [19]: %timeit np.array([row[ix] for row, ix in zip(X, ixs)]) 928 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [20]: %timeit X[np.arange(len(ixs)), ixs.T].T 23.6 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [21]: %timeit X.take(idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None]) 20.6 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [22]: %timeit np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype, count=ixs.size).reshape(ixs.shape) 1.42 ms ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 

@mxbi I added some timings and my results do not quite match yours, you should check this out

Here's a big array:

 In [33]: X = np.arange(10000*100000, dtype=np.int32).reshape(10000,-1) In [34]: ixs = np.random.randint(0, 100000, (10000, 2)) In [35]: ixs.sort(axis=1) In [36]: X.shape Out[36]: (10000, 100000) In [37]: ixs.shape Out[37]: (10000, 2) 

With some results:

 In [42]: %timeit np.array([row[ix] for row, ix in zip(X, ixs)]) 11.4 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [43]: %timeit X[np.arange(len(ixs)), ixs.T].T 596 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [44]: %timeit X.take(ixs+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None]) 540 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 

Now we use column indices 500 instead of two, and we see that the victory over the list begins:

 In [45]: ixs = np.random.randint(0, 100000, (10000, 500)) In [46]: ixs.sort(axis=1) In [47]: %timeit np.array([row[ix] for row, ix in zip(X, ixs)]) 93 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [48]: %timeit X[np.arange(len(ixs)), ixs.T].T 133 ms ± 638 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [49]: %timeit X.take(ixs+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None]) 87.5 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
+3
source

The usual suggestion for indexing elements from strings is:

 X[np.arange(X.shape[0])[:,None], ixs] 

That is, create a row index (n, 1) (column vector) that will be broadcast with the form (n, m) ixs to give a solution (n, m).

This is basically the same as:

 X[np.arange(len(ixs)), ixs.T].T 

which passes (n,) an index with respect to (m, n) and carries.

The terms are essentially the same:

 In [299]: X = np.ones((1000,2000)) In [300]: ixs = np.random.randint(0,2000,(1000,200)) In [301]: timeit X[np.arange(len(ixs)), ixs.T].T 6.58 ms ± 71.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [302]: timeit X[np.arange(X.shape[0])[:,None], ixs] 6.57 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

and for comparison:

 In [307]: timeit np.array([row[ix] for row, ix in zip(X, ixs)]) 6.63 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

I am a little surprised that this list comprehension is so good. It is interesting how relative advantages are compared when resizing, especially in the relative form of X and ixs (long, wide, etc.).


The first solution is the indexing style created by ix_ :

 In [303]: np.ix_(np.arange(3), np.arange(2)) Out[303]: (array([[0], [1], [2]]), array([[0, 1]])) 
+1
source

This should work

 [X[i][[y]] for i, y in enumerate(ixs)] 

EDIT: I just noticed that you do not want a solution for the loop.

0
source

Source: https://habr.com/ru/post/1275587/


All Articles