Calculate weighted paired distance matrix in Python

I am trying to find the fastest way to do the following pairwise distance calculation in Python. I want to use distances to rank a list_of_objects according to their similarity.

Each element in list_of_objects characterized by four dimensions a, b, c, d, which are performed on very different scales, for example:

 object_1 = [0.2, 4.5, 198, 0.003] object_2 = [0.3, 2.0, 999, 0.001] object_3 = [0.1, 9.2, 321, 0.023] list_of_objects = [object_1, object_2, object_3] 

The goal is to get a paired matrix of distances of objects in list_of_objects . However, I want to be able to indicate the "relative importance" of each dimension when calculating the distance through the vector of weights with one weight per dimension, for example:

 weights = [1, 1, 1, 1] 

means that all measurements are equally weighted. In this case, I want each dimension to make an equal contribution to the distance between the objects, regardless of the scale of the measurement. As an alternative:

 weights = [1, 1, 1, 10] 

indicates that I want the dimension d to contribute 10 times more than the other dimensions by the distance between the objects.

My current algorithm is as follows:

  • Calculate a pair distance matrix for each measurement
  • Normalize each distance matrix so that the maximum is 1
  • Multiply each distance matrix by the corresponding weight from weights
  • Summing distance matrices to create one pairwise matrix
  • Use the matrix of 4 to provide a ranked list of pairs of objects from list_of_objects

This works great and gives me a weighted version of the distance between the blocks of the city.

I have two questions:

  • Without changing the algorithm, what is the fastest implementation in SciPy, NumPy or SciKit-Learn that calculates the initial distances.

  • Is there an existing multidimensional approach to distance that does all this for me?

In Q 2, I looked, but could not find anything with a built-in step that makes "relative importance" what I want.

Other suggestions are welcome. Glad to clarify if I have missed the details.

+6
source share
2 answers

scipy.spatial.distance is the module you want to look at. It has many different standards that can be easily applied.

I would recommend using a weighted Monkowski Metrik

Weighted Minkowski Metric

You can perform pairwise distance calculations using the pdist method from this package.

eg.

 import numpy as np from scipy.spatial.distance import pdist, wminkowski, squareform object_1 = [0.2, 4.5, 198, 0.003] object_2 = [0.3, 2.0, 999, 0.001] object_3 = [0.1, 9.2, 321, 0.023] list_of_objects = [object_1, object_2, object_3] # make a 4x3 matrix from list of objects X = np.array(list_of_objects) #calculate pairwise distances, using weighted Minkowski norm distances = pdist(X,wminkowski,2, [1,1,1,10]) #make a square matrix from result distances_as_2d_matrix = squareform(distances) print distances print distances_as_2d_matrix 

Will open

 [ 801.00390786 123.0899671 678.0382942 ] [[ 0. 801.00390786 123.0899671 ] [ 801.00390786 0. 678.0382942 ] [ 123.0899671 678.0382942 0. ]] 
+8
source

The normalization step, in which you divide pairwise distances by the maximum value, seems non-standard, and can make it difficult to find a ready-made function that will do what you need. It is quite easy, though to do it yourself. The starting point is turning your list_of_objects into an array:

 >>> obj_arr = np.array(list_of_objects) >>> obj_arr.shape (3L, 4L) 

Then you can get pairwise distances using broadcast transmission. This is a little inefficient because it does not use the symmetry of your metric and calculates each distance twice:

 >>> dists = np.abs(obj_arr - obj_arr[:, None]) >>> dists.shape (3L, 3L, 4L) 

Normalization is very simple:

 >>> dists /= dists.max(axis=(0, 1)) 

And your final weighing can be done in various ways, you may want to test faster:

 >>> dists.dot([1, 1, 1, 1]) array([[ 0. , 1.93813131, 2.21542674], [ 1.93813131, 0. , 3.84644195], [ 2.21542674, 3.84644195, 0. ]]) >>> np.einsum('ijk,k->ij', dists, [1, 1, 1, 1]) array([[ 0. , 1.93813131, 2.21542674], [ 1.93813131, 0. , 3.84644195], [ 2.21542674, 3.84644195, 0. ]]) 
+3
source

Source: https://habr.com/ru/post/958445/


All Articles