The key point in Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations with numpy arrays. This usually means that you must write out all the loops manually and work with single elements of the array at a time.
There's a very useful tutorial here that covers the process of converting numpy code to Cython and optimizing it.
Here's a quick hit in a more optimized version of your remote Cython function:
import numpy as np cimport numpy as np cimport cython
I saved this in a file called fastdist.pyx . We can use pyximport to simplify the build process:
import pyximport pyximport.install() import fastdist import numpy as np A = np.random.randn(100, 200) D1 = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2)) D2 = fastdist.dist(A) print np.allclose(D1, D2)
So it works, at least. Let's do some benchmarking using %timeit magic:
%timeit np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2)) # 100 loops, best of 3: 10.6 ms per loop %timeit fastdist.dist(A) # 100 loops, best of 3: 1.21 ms per loop
A ~ 9x speed-up is a good, but not very gaming device. However, as you said, the big problem with using the broadcast approach is the memory requirements for creating an intermediate array.
A2 = np.random.randn(1000, 2000) %timeit fastdist.dist(A2)
I would not recommend trying to use broadcast ...
Another thing we could do is parallelize this along the outermost loop using the prange function:
from cython.parallel cimport prange ... for ii in prange(nrow, nogil=True, schedule='guided'): ...
To compile the parallel version, you need to tell the compiler to enable OpenMP. I did not understand how to do this using pyximport , but if you are using gcc , you can manually compile it as follows:
$ cython fastdist.pyx $ gcc -shared -pthread -fPIC -fwrapv -fopenmp -O3 \ -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o fastdist.so fastdist.c
With parallelism using 8 threads:
%timeit D2 = fastdist.dist_parallel(A2) 1 loops, best of 3: 509 ms per loop