Find the minimum cosine distance between two matrices

I have two 2D np.arraysallow you to call them Aand B, both have a shape. For each vector in a 2D array, AI need to find a vector in a matrix Bthat has a minimum cosine distance. For this, I have a double loop inside which I am trying to find the minimum value. So basically I do the following:

from scipy.spatial.distance import cosine
l, res = A.shape[0], []
for i in xrange(l):
    minimum = min((cosine(A[i], B[j]), j) for j in xrange(l))
    res.append(minimum[1])

In the code above, one of the loops is hidden. Everything works fine, but the double for loop makes it too slow (I tried to rewrite it with double understanding, which made things a little faster, but still slow).

I believe that there is a numpy function that can achieve the following faster (using some linear algebra).

So, is there a way to achieve what I want faster?

+4
source share
2 answers

From cosine docswe have the following information -

scipy.spatial.distance.cosine (u, v) : calculates the cosine distance between 1-D arrays.

The cosine distance between uand vis defined as

enter image description here

where u⋅vis the point product uand v.

Using the above formula, we would have one vectorized solution using NumPy Broadcast Ability , for example:

# Get the dot products, L2 norms and thus cosine distances
dots = np.dot(A,B.T)
l2norms = np.sqrt(((A**2).sum(1)[:,None])*((B**2).sum(1)))
cosine_dists = 1 - (dots/l2norms)

# Get min values (if needed) and corresponding indices along the rows for res.
# Take care of zero L2 norm values, by using nanmin and nanargmin  
minval = np.nanmin(cosine_dists,axis=1)
cosine_dists[np.isnan(cosine_dists).all(1),0] = 0
res = np.nanargmin(cosine_dists,axis=1)

Runtime Tests -

In [81]: def org_app(A,B):
    ...:    l, res, minval = A.shape[0], [], []
    ...:    for i in xrange(l):
    ...:        minimum = min((cosine(A[i], B[j]), j) for j in xrange(l))
    ...:        res.append(minimum[1])
    ...:        minval.append(minimum[0])
    ...:    return res, minval
    ...: 
    ...: def vectorized(A,B):
    ...:     dots = np.dot(A,B.T)
    ...:     l2norms = np.sqrt(((A**2).sum(1)[:,None])*((B**2).sum(1)))
    ...:     cosine_dists = 1 - (dots/l2norms)
    ...:     minval = np.nanmin(cosine_dists,axis=1)
    ...:     cosine_dists[np.isnan(cosine_dists).all(1),0] = 0
    ...:     res = np.nanargmin(cosine_dists,axis=1)
    ...:     return res, minval
    ...: 

In [82]: A = np.random.rand(400,500)
    ...: B = np.random.rand(400,500)
    ...: 

In [83]: %timeit org_app(A,B)
1 loops, best of 3: 10.8 s per loop

In [84]: %timeit vectorized(A,B)
10 loops, best of 3: 145 ms per loop

Confirm Results -

In [86]: x1, y1 = org_app(A, B)
    ...: x2, y2 = vectorized(A, B)
    ...: 

In [87]: np.allclose(np.asarray(x1),x2)
Out[87]: True

In [88]: np.allclose(np.asarray(y1)[~np.isnan(np.asarray(y1))],y2[~np.isnan(y2)])
Out[88]: True
+3
source

Usage scipy.spatial.distance.cdist:

from scipy.spatial.distance import cdist

def cdist_func(A, B):
    dists = cdist(A, B, 'cosine')
    return np.argmin(dists, axis=1), np.min(dists, axis=1)

, :

x2, y2 = vectorized(A, B)
x3, y3 = cdist_func(A, B)

np.allclose(x2, x3) # True
np.allclose(y2, y3) # True

:

%timeit vectorized(A, B) # 11.9 ms per loop
%timeit cdist_func(A, B) # 85.9 ms per loop
+1

Source: https://habr.com/ru/post/1608268/


All Articles