Pdist for aan tensor

Question

Pdist for aan tensor

I have an anano matrix matrix

x = T.fmatrix('input')

x will later be filled with vectors n dim d (in train time).

I would like to have an anaeno analog of pdist ( scipy.spatial.distance.pdist pdist ), something like

 D = theano.pdist( x )

How can i achieve this?

Calling scipy.spatial.distance.pdist on x does not work directly, since x is only symbolic at this point ...

Update:. I would really like to imitate pdist "compact" behavior, i.e. calculate only ~ 1/2 entries n x n distance matrix.

+6

python scipy theano matlab

Shai Sep 17 '14 at 9:04

source share

2 answers

I haven’t worked with Theano before, but here is a solution based on pure Numpy functions (maybe you will convert it to equivalent anano functions. Note that I use automatic broadcasting in the expression below, so you can have to rewrite this explicitly if Theano does not support it):

 # X is an m-by-n matrix (rows are examples, columns are dimensions) # D is an m-by-m symmetric matrix of pairwise Euclidean distances a = np.sum(X**2, axis=1) D = np.sqrt((a + a[np.newaxis].T) - 2*np.dot(X, XT))

It is based on the fact that: ||uv||^2 = ||u||^2 + ||v||^2 - 2*uv . (I showed this in the previous answers of my use using MATLAB)

Here is a comparison with existing Scipy features:

 import numpy as np from scipy.spatial.distance import pdist, squareform def my_pdist(X): a = np.sum(X**2, axis=1) D = np.sqrt((a + a[np.newaxis].T) - 2*np.dot(X, XT)) return D def scipy_pdist(X): D = squareform(pdist(X, metric='euclidean')) return DX = np.random.rand(5, 3) D1 = my_pdist(X) D2 = scipy_pdist(X)

The difference should be slight, next to the epsilon car ( np.spacing(1) ):

 >>> np.linalg.norm(D1-D2) 8.5368137554718277e-16

NTN

EDIT:

Here is another implementation with one loop:

 def my_pdist_compact(X): D = np.empty(shape=[0,0], dtype=X.dtype) for i in range(X.shape[0]-1): D = np.append(D, np.sqrt(np.sum((X[i,] - X[i+1:,])**2, axis=1))) return D

Somewhat equivalent MATLAB code:

 function D = my_pdist_compact(X) n = size(X,1); D = cell(n-1,1); for i=1:n-1 D{i} = sqrt(sum(bsxfun(@minus, X(i,:), X(i+1:end,:)).^2, 2)); end D = vertcat(D{:}); end

This returns paired distances in a compact form (the upper triangular part of the symmetric matrix). This is the same result as pdist . Use squareform to convert it to a full matrix.

 >>> d1 = my_pdist_compact(X) >>> d2 = pdist(X) # from scipy.spatial.distance >>> (d1 == d2).all() True

I will leave it to you to see if it is possible to record an equivalent loop using Theano (see theano.scan )!

-2

Amro 21 sept '14 at 14:21

source share

eickenberg · Accepted Answer · 2014-09-26T17:53:15+0000

pdist 's pdist is a collection of different functions - there is no Anan equivalent for all of them at once. However, each specific distance, which is a mathematical expression of a closed form, can be written into Theano as such, and then compiled.

Take, for example, the Minkowski mink distance p (copy + paste):

 import theano import theano.tensor as T X = T.fmatrix('X') Y = T.fmatrix('Y') P = T.scalar('P') translation_vectors = X.reshape((X.shape[0], 1, -1)) - Y.reshape((1, Y.shape[0], -1)) minkowski_distances = (abs(translation_vectors) ** P).sum(2) ** (1. / P) f_minkowski = theano.function([X, Y, P], minkowski_distances)

Note that abs calls the built-in __abs__ , so abs also an anano function. Now we can compare this with pdist :

 import numpy as np from scipy.spatial.distance import pdist rng = np.random.RandomState(42) d = 20 # dimension nX = 10 nY = 30 x = rng.randn(nX, d).astype(np.float32) y = rng.randn(nY, d).astype(np.float32) ps = [1., 3., 2.] for p in ps: d_theano = f_minkowski(x, x, p)[np.triu_indices(nX, 1)] d_scipy = pdist(x, p=p, metric='minkowski') print "Testing p=%1.2f, discrepancy %1.3e" % (p, np.sqrt(((d_theano - d_scipy) ** 2).sum()))

This gives

 Testing p=1.00, discrepancy 1.322e-06 Testing p=3.00, discrepancy 4.277e-07 Testing p=2.00, discrepancy 4.789e-07

As you can see, there is a match, but the f_minkowski function f_minkowski somewhat more general, as it compares the strings of two, possibly different arrays. If the same array is passed twice as input, f_minkowski returns a matrix, while pdist returns a list without redundancy. If this behavior is desired, it can also be implemented completely dynamically, but I will adhere to the general case here.

One possibility of specialization should be noted: in the case p=2 calculations become simpler using a binomial formula, and this can be used to save precious memory space: if the total Minkowski distance, as implemented above, creates a 3D array (due to avoiding for- loops and summing cumulatively), which is prohibitive, depending on the size of d (and nX, nY ), for p=2 we can write

 squared_euclidean_distances = (X ** 2).sum(1).reshape((X.shape[0], 1)) + (Y ** 2).sum(1).reshape((1, Y.shape[0])) - 2 * X.dot(YT) f_euclidean = theano.function([X, Y], T.sqrt(squared_euclidean_distances))

which uses O(nX * nY) space instead of O(nX * nY * d) We check for compliance, this time on a common problem:

 d_eucl = f_euclidean(x, y) d_minkowski2 = f_minkowski(x, y, 2.) print "Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy %1.3e" % ((d_eucl - d_minkowski2) ** 2).sum()

getting

 Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy 1.464e-11

Pdist for aan tensor

EDIT:

More articles: