Distance between arrays numpy, columnwise

Question

Distance between arrays numpy, columnwise

I have 2 arrays in 2D where column vectors are feature vectors. One array has size F x A, the other F x B, where A <B. As an example, for A = 2 and F = 3 (B can be any):

arr1 = np.array( [[1, 4], [2, 5], [3, 6]] ) arr2 = np.array( [[1, 4, 7, 10, ..], [2, 5, 8, 11, ..], [3, 6, 9, 12, ..]] )

I want to calculate the distance between arr1 and an arr2 fragment that is of equal size (in this case 3x2), for each possible arr2 fragment. The column vectors are independent of each other, so I believe that I should calculate the distance between each column in arr1 and the set of column vectors from i to i + A from arr2 and take the sum of these distances (not sure though).

Does numpy offer an efficient way to do this, or do I need to take slices from the second array and using another loop, calculate the distance between each column in arr1 and the corresponding column vector in the slice?

An example for clarity using the arrays above:

 >>> magical_distance_func(arr1, arr2[:,:2]) [0, 10.3923..] >>> # First, distance between arr2[:,:2] and arr1, which equals 0. >>> # Second, distance between arr2[:,1:3] and arr1, which equals >>> diff = arr1 - np.array( [[4,7],[5,8],[6,9]] ) >>> diff [[-3, -3], [-3, -3], [-3, -3]] >>> # this happens to consist only of -3's. Norm of each column vector is: >>> norm1 = np.linalg.norm([:,0]) >>> norm2 = np.linalg.norm([:,1]) >>> # would be extremely good if this worked for an arbitrary number of norms >>> totaldist = norm1 + norm2 >>> totaldist 10.3923...

Of course, moving arrays is fine too, if that means cdist can be used here somehow.

+6

python arrays numpy

user1444165 Jun 19 '12 at 14:16

source share

3 answers

You can get the distance matrix using cdist from scipy.spatial.distance. When you have a distance matrix, you can simply sum the columns and normalize to get the average distance if that is what you are looking for.

Note. Instead of columns, cdist uses rows to calculate pairwise distances.

Here you have an example using the cosine distance:

 from scipy.spatial.distance import cdist arr1 = np.array( [[1, 7], [4, 8], [4, 0]] ) arr2 = array( [[1, 9, 3, 6, 2], [3, 9, 0, 2, 3], [6, 0, 2, 7, 4]] ) # distance matrix D = cdist( arr1.transpose(), arr2.transpose(), 'cosine' ) # average distance array (each position corresponds to each column of arr1) d1 = D.mean( axis=1 ) # average distance array (each position corresponds to each column of arr2) d2 = D.mean( axis=0 ) # Results d1 = array([ 0.23180963, 0.35643282]) d2 = array([ 0.31018485, 0.19337869, 0.46050302, 0.3233269 , 0.18321265])

There are many distances. Check out the documentation .

+2

Orientol nieto Jun 19 '12 at 15:16

source share

scipy.spatial.distance.cdist?

+1

reptilicus Jun 19 '12 at 14:44

source share

senderle · Accepted Answer · 2012-06-19T15:13:15+0000

If I understand your question correctly, this will work. Knowing numpy is probably the best way, but it is at least pretty simple. I used some contrived coordinates to show that the calculation works as expected.

 >>> arr1 array([[0, 3], [1, 4], [2, 5]]) >>> arr2 array([[ 3, 6, 5, 8], [ 5, 8, 13, 16], [ 2, 5, 2, 5]])

You can subtract arr1 from arr2 , making sure they are broadcast against each other correctly. The best way I could think of is to take a transpose and make some changes. They do not create copies - they create representations - so it is not so wasteful. ( dist is a copy, though.)

 >>> dist = (arr2.T.reshape((2, 2, 3)) - arr1.T).reshape((4, 3)) >>> dist array([[ 3, 4, 0], [ 3, 4, 0], [ 5, 12, 0], [ 5, 12, 0]])

Now all we need to do is apply numpy.linalg.norm to axis 1. (You can choose one of several norms ).

 >>> numpy.apply_along_axis(numpy.linalg.norm, 1, dist) array([ 5., 5., 13., 13.])

Assuming you want a simple Euclidean distance, you can also do this directly; not sure if this will be faster or slower, try both:

 >>> (dist ** 2).sum(axis=1) ** 0.5 array([ 5., 5., 13., 13.])

Based on your editing, we should only make one small tweak. Since you want to test the columns in half, and not block by block, you need a window for riding. This can be done very simply with fairly simple indexing:

 >>> arr2.T[numpy.array(zip(range(0, 3), range(1, 4)))] array([[[ 3, 5, 2], [ 6, 8, 5]], [[ 6, 8, 5], [ 5, 13, 2]], [[ 5, 13, 2], [ 8, 16, 5]]])

Combining this with other tricks:

 >>> arr2_pairs = arr2.T[numpy.array(zip(range(0, 3), range(1, 4)))] >>> dist = arr2_pairs - arr1.T >>> (dist ** 2).sum(axis=2) ** 0.5 array([[ 5. , 5. ], [ 9.69535971, 9.69535971], [ 13. , 13. ]])

However, converting arrays from list contexts tends to be slow. It may be faster to use stride_tricks - here, see which one is best for your purposes:

 >>> as_strided(arr2.T, strides=(8, 8, 32), shape=(3, 2, 3)) array([[[ 3, 5, 2], [ 6, 8, 5]], [[ 6, 8, 5], [ 5, 13, 2]], [[ 5, 13, 2], [ 8, 16, 5]]])

This actually manipulates how numpy moves around the block of memory, allowing small arrays to emulate a larger array.

 >>> arr2_pairs = as_strided(arr2.T, strides=(8, 8, 32), shape=(3, 2, 3)) >>> dist = arr2_pairs - arr1.T >>> (dist ** 2).sum(axis=2) ** 0.5 array([[ 5. , 5. ], [ 9.69535971, 9.69535971], [ 13. , 13. ]])

So now you have a simple 2nd array corresponding to the distance for each pair of columns. Now it's just a matter of getting mean and calling argmin .

 >>> normed = (dist ** 2).sum(axis=2) ** 0.5 >>> normed.mean(axis=1) array([ 5. , 9.69535971, 13. ]) >>> min_window = normed.mean(axis=1).argmin() >>> arr2[:,[min_window, min_window + 1]] array([[3, 6], [5, 8], [2, 5]])

Distance between arrays numpy, columnwise

More articles: