Speed ​​up distance matrix computation with Numpy and Cython

Consider a multidimensional array A of dimension NxM. The goal is to calculate the Euclidean distance matrix D, where each element D [i, j] is the Euclidean distance between the rows i and j. What is the fastest way to do this? This is not quite a problem that I need to solve, but it is a good example of what I'm trying to do (in general, other distance metrics can be used).

This is the fastest I could think of so far:

n = A.shape[0] D = np.empty((n,n)) for i in range(n): D[i] = np.sqrt(np.square(AA[i]).sum(1)) 

But is this the fastest way? I am mostly concerned about the for loop. Can we defeat this, say, with Keaton?

To avoid a loop, I tried using translation and doing something like this:

 D = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2)) 

But it turned out to be a bad idea, because there is some overhead in building an intermediate 3D NxNxM array, so performance is worse.

I tried Cython. But I'm new to Keaton, so I don't know how good my attempt is:

 def dist(np.ndarray[np.int32_t, ndim=2] A): cdef int n = A.shape[0] cdef np.ndarray[np.float64_t, ndim=2] dm = np.empty((n,n), dtype=np.float64) cdef int i = 0 for i in range(n): dm[i] = np.sqrt(np.square(AA[i]).sum(1)).astype(np.float64) return dm 

The code above was a bit slower than Python for the loop. I don’t know much about Cython, but I assume that I could achieve at least the same performance as for the + numpy loop. And I wonder if it is possible to achieve a noticeable performance improvement when this is done correctly? Or is there another way to speed this up (without using parallel computing)?

+6
source share
1 answer

The key point in Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations with numpy arrays. This usually means that you must write out all the loops manually and work with single elements of the array at a time.

There's a very useful tutorial here that covers the process of converting numpy code to Cython and optimizing it.

Here's a quick hit in a more optimized version of your remote Cython function:

 import numpy as np cimport numpy as np cimport cython # don't use np.sqrt - the sqrt function from the C standard library is much # faster from libc.math cimport sqrt # disable checks that ensure that array indices don't go out of bounds. this is # faster, but you'll get a segfault if you mess up your indexing. @cython.boundscheck(False) # this disables 'wraparound' indexing from the end of the array using negative # indices. @cython.wraparound(False) def dist(double [:, :] A): # declare C types for as many of our variables as possible. note that we # don't necessarily need to assign a value to them at declaration time. cdef: # Py_ssize_t is just a special platform-specific type for indices Py_ssize_t nrow = A.shape[0] Py_ssize_t ncol = A.shape[1] Py_ssize_t ii, jj, kk # this line is particularly expensive, since creating a numpy array # involves unavoidable Python API overhead np.ndarray[np.float64_t, ndim=2] D = np.zeros((nrow, nrow), np.double) double tmpss, diff # another advantage of using Cython rather than broadcasting is that we can # exploit the symmetry of D by only looping over its upper triangle for ii in range(nrow): for jj in range(ii + 1, nrow): # we use tmpss to accumulate the SSD over each pair of rows tmpss = 0 for kk in range(ncol): diff = A[ii, kk] - A[jj, kk] tmpss += diff * diff tmpss = sqrt(tmpss) D[ii, jj] = tmpss D[jj, ii] = tmpss # because D is symmetric return D 

I saved this in a file called fastdist.pyx . We can use pyximport to simplify the build process:

 import pyximport pyximport.install() import fastdist import numpy as np A = np.random.randn(100, 200) D1 = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2)) D2 = fastdist.dist(A) print np.allclose(D1, D2) # True 

So it works, at least. Let's do some benchmarking using %timeit magic:

 %timeit np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2)) # 100 loops, best of 3: 10.6 ms per loop %timeit fastdist.dist(A) # 100 loops, best of 3: 1.21 ms per loop 

A ~ 9x speed-up is a good, but not very gaming device. However, as you said, the big problem with using the broadcast approach is the memory requirements for creating an intermediate array.

 A2 = np.random.randn(1000, 2000) %timeit fastdist.dist(A2) # 1 loops, best of 3: 1.36 s per loop 

I would not recommend trying to use broadcast ...

Another thing we could do is parallelize this along the outermost loop using the prange function:

 from cython.parallel cimport prange ... for ii in prange(nrow, nogil=True, schedule='guided'): ... 

To compile the parallel version, you need to tell the compiler to enable OpenMP. I did not understand how to do this using pyximport , but if you are using gcc , you can manually compile it as follows:

 $ cython fastdist.pyx $ gcc -shared -pthread -fPIC -fwrapv -fopenmp -O3 \ -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o fastdist.so fastdist.c 

With parallelism using 8 threads:

 %timeit D2 = fastdist.dist_parallel(A2) 1 loops, best of 3: 509 ms per loop 
+7
source

Source: https://habr.com/ru/post/973574/


All Articles