Vectorize numpy is unique to subarrays

I have the numerical data of an array of form (N, 20, 20), where N is some very large number. I want to get the number of unique values ​​in each of the 20-segment 20x20 arrays. with a loop that will be:

values = []
for i in data:
    values.append(len(np.unique(i)))

How can I vectorize this loop? speed is a problem.

If I try np.unique (data), I get unique values ​​for the entire data array, and not separate 20x20 blocks, so that's not what I need.

+4
source share
1 answer

First, you can work with data.reshape(N,-1), since you are interested in sorting the last two dimensions.

- :

[len(set(i)) for i in data.reshape(data.shape[0],-1)]

, , .

"" , . " " - , "". "" , .

:

np.sort(data.reshape(N,-1))

array([[1, 2, 2, 3, 3, 5, 5, 5, 6, 6],
       [1, 1, 1, 2, 2, 2, 3, 3, 5, 7],
       [0, 0, 2, 3, 4, 4, 4, 5, 5, 9],
       [2, 2, 3, 3, 4, 4, 5, 7, 8, 9],
       [0, 2, 2, 2, 2, 5, 5, 5, 7, 9]])

? :

In [530]: data=np.random.randint(10,size=(5,10))

In [531]: [len(set(i)) for i in data.reshape(data.shape[0],-1)]
Out[531]: [7, 6, 6, 8, 6]

In [532]: sdata=np.sort(data,axis=1)
In [533]: (np.diff(sdata)>0).sum(axis=1)+1            
Out[533]: array([7, 6, 6, 8, 6])

, np.unique , .


[(np.bincount(i)>0).sum() for i in data]

, , len(set(i)), diff...sort.

[585]: data.shape Out [585]: (10000, 400)

In [586]: timeit [(np.bincount(i)>0).sum() for i in data]
1 loops, best of 3: 248 ms per loop

In [587]: %%timeit                                       
sdata=np.sort(data,axis=1)
(np.diff(sdata)>0).sum(axis=1)+1
   .....: 
1 loops, best of 3: 280 ms per loop

bincount, np.count_nonzero

In [715]: timeit np.array([np.count_nonzero(np.bincount(i)) for i in data])
10 loops, best of 3: 59.6 ms per loop

. , count_nonzero (, np.nonzero), . , . ( diff...sort, ).

+3

Source: https://habr.com/ru/post/1605767/


All Articles