Speed up a NumPy loop on 2D arrays - removes lines similar

Question

Speed up a NumPy loop on 2D arrays - removes lines similar

I am trying to speed up the following code significantly, but to no avail. The code accepts a 2D array and deletes the lines of the array that are too similar compared to other lines in the array. See below code and comments.

as0 = a.shape[0]
for i in range(as0):
    a2s0 = a.shape[0] # shape may change after each iteration
    if i > (a2s0 - 1):
        break
    # takes the difference between all rows in array by iterating over each
    # row. Then sums the absolutes. The condition finally gives a boolean
    # array output - similarity condition of 0.01
    t = np.sum(np.absolute(a[i,:] - a), axis=1)<0.01
    # Retains the indices that are too similar and then deletes the
    # necessary row
    inddel = np.where(t)[0]
    inddel = [k for k in inddel if k != i]
    a = np.delete(a, inddel, 0)

I was wondering if vectorization is possible, but I'm not too familiar with it. Any help would be greatly appreciated.

Edit:

    if i >= (a2s0 - 1): # Added equality sign
        break
    # Now this only calculates over rows that have not been compared.
    t = np.sum(np.absolute(a[i,:] - a[np.arange(i+1,a2s0),:]), axis=1)>0.01
    t = np.concatenate((np.ones(i+1,dtype=bool), t))
    a = a[t, :]

+4

performance python vectorization numpy

nrcjea001 Mar 18 '17 at 7:17

source share

3 answers

:

np.random.seed(0)
a = np.random.random((4,3))

:

array([[ 0.5488135 ,  0.71518937,  0.60276338],
       [ 0.54488318,  0.4236548 ,  0.64589411],
       [ 0.43758721,  0.891773  ,  0.96366276],
       [ 0.38344152,  0.79172504,  0.52889492]])

, . Manhattan Distance:

d = sklearn.metrics.pairwise.manhattan_distances(a)

:

array([[ 0.        ,  0.33859562,  0.64870931,  0.31577611],
       [ 0.33859562,  0.        ,  0.89318282,  0.6465111 ],
       [ 0.64870931,  0.89318282,  0.        ,  0.5889615 ],
       [ 0.31577611,  0.6465111 ,  0.5889615 ,  0.        ]])

, :

m = np.tril(d < 0.4, -1) # large threshold just for this example

:

array([[False, False, False, False],
       [ True, False, False, False],
       [False, False, False, False],
       [ True, False, False, False]], dtype=bool)

, 0 1 3. , True:

a[~np.any(m, axis=0)] # axis can be either 0 or 1 - design choice

:

array([[ 0.54488318,  0.4236548 ,  0.64589411],
       [ 0.43758721,  0.891773  ,  0.96366276],
       [ 0.38344152,  0.79172504,  0.52889492]])

:

d = sklearn.metrics.pairwise.manhattan_distances(a)
a = a[~np.any(np.tril(d < 0.4, -1), axis=0)]

+4

John Zwinck 18 . '17 7:53

:

t = np.sum(np.absolute(a[i,:] - a), axis=1)<0.01

a . , , , , . , ?

, , , , . , , , .

+2

Steve Barnes 18 . '17 7:30

source share

Divakar · Accepted Answer · 2017-03-18T07:34:22+0000

Approach # 1: Broadcasting

Here is one vector approach that, broadcastingwhen expanding ato 3D, and then performing these calculations at all iterations in vector form -

mask = (np.absolute(a[:,None] - a)).sum(2) < 0.01
a_out = a[~np.triu(mask,1).any(0)]

# 2: pdist ('cityblock')

. , pdist 'cityblock', , row/col searchsorted .

-

from scipy.spatial.distance import pdist

n = a.shape[0]
dists = pdist(a, 'cityblock')
idx = np.flatnonzero(dists < thresh)
sep_idx = np.arange(n-1,0,-1).cumsum()
rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
a_out = np.delete(a,rm_idx,axis=0)

-

# Approach#2 from this post
def remove_similar_rows(a, thresh=0.01):
    n = a.shape[0]
    dists = pdist(a, 'cityblock')
    idx = np.flatnonzero(dists < thresh)
    sep_idx = np.arange(n-1,0,-1).cumsum()
    rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
    return np.delete(a,rm_idx,axis=0)

# @John Zwinck soln
def pairwise_manhattan_distances(a, thresh=0.01):
    d = manhattan_distances(a)
    return a[~np.any(np.tril(d < thresh), axis=0)]

-

In [209]: a = np.random.randint(0,9,(4000,30))

# Let set 100 rows randomly as dups    
In [210]: idx0 = np.random.choice(4000,size=100, replace=0)

In [211]: idx1 = np.random.choice(4000,size=100, replace=0)

In [217]: a[idx0] = a[idx1]

In [238]: %timeit pairwise_manhattan_distances(a, thresh=0.01)
1 loops, best of 3: 225 ms per loop

In [239]: %timeit remove_similar_rows(a, thresh=0.01)
10 loops, best of 3: 100 ms per loop

Speed ​​up a NumPy loop on 2D arrays - removes lines similar

More articles:

Speed up a NumPy loop on 2D arrays - removes lines similar