Cython acceleration is not as great as expected

Question

Cython acceleration is not as great as expected

I wrote a Python function that calculates pairwise electromagnetic interactions between a large number (N ~ 10 ^ 3) of particles and stores the results in the NxN128 ndarray complex. It works, but it is the slowest part of a larger program, taking about 40 seconds when N = 900 [fixed]. The source code is as follows:

import numpy as np def interaction(s,alpha,kprop): # s is an Nx3 real array # alpha is complex # kprop is float ndipoles = s.shape[0] Amat = np.zeros((ndipoles,3, ndipoles, 3), dtype=np.complex128) I = np.array([[1,0,0],[0,1,0],[0,0,1]]) im = complex(0,1) k2 = kprop*kprop for i in range(ndipoles): xi = s[i,:] for j in range(ndipoles): if i != j: xj = s[j,:] dx = xi-xj R = np.sqrt(dx.dot(dx)) n = dx/R kR = kprop*R kR2 = kR*kR A = ((1./kR2) - im/kR) nxn = np.outer(n, n) nxn = (3*A-1)*nxn + (1-A)*I nxn *= -alpha*(k2*np.exp(im*kR))/R else: nxn = I Amat[i,:,j,:] = nxn return(Amat.reshape((3*ndipoles,3*ndipoles)))

I had never used Cython before, but it seemed like a good place for my acceleration efforts, so I almost completely blindly adapted the methods that I found in online tutorials. I got acceleration (30 seconds versus 40 seconds), but not as dramatic as I expected, so I'm wondering if I'm doing something wrong or not seeing a critical step. The following is my best attempt at cytology of the above procedure:

 import numpy as np cimport numpy as np DTYPE = np.complex128 ctypedef np.complex128_t DTYPE_t def interaction(np.ndarray s, DTYPE_t alpha, float kprop): cdef float k2 = kprop*kprop cdef int i,j cdef np.ndarray xi, xj, dx, n, nxn cdef float R, kR, kR2 cdef DTYPE_t A cdef int ndipoles = s.shape[0] cdef np.ndarray Amat = np.zeros((ndipoles,3, ndipoles, 3), dtype=DTYPE) cdef np.ndarray I = np.array([[1,0,0],[0,1,0],[0,0,1]]) cdef DTYPE_t im = complex(0,1) for i in range(ndipoles): xi = s[i,:] for j in range(ndipoles): if i != j: xj = s[j,:] dx = xi-xj R = np.sqrt(dx.dot(dx)) n = dx/R kR = kprop*R kR2 = kR*kR A = ((1./kR2) - im/kR) nxn = np.outer(n, n) nxn = (3*A-1)*nxn + (1-A)*I nxn *= -alpha*(k2*np.exp(im*kR))/R else: nxn = I Amat[i,:,j,:] = nxn return(Amat.reshape((3*ndipoles,3*ndipoles)))

+5

python numpy cython

Grant petty Oct 11 '15 at 17:19

source share

1 answer

Divakar · Accepted Answer · 2015-10-11T18:52:59+0000

The real power of NumPy is to perform an operation through a huge number of elements in a vectorized form instead of using this operation in pieces arranged in loops. In your case, you use two nested loops and one conditional IF statement. I would suggest expanding the sizes of intermediate arrays that would lead to NumPy powerful broadcasting capability to enter the game, and thus the same operations could be used for all elements at a time instead of small pieces of data in a loop.

None / np.newaxis can be used to expand sizes. Thus, a vector implementation to follow this premise would look like this:

 def vectorized_interaction(s,alpha,kprop): im = complex(0,1) I = np.array([[1,0,0],[0,1,0],[0,0,1]]) k2 = kprop*kprop # Vectorized calculations for dx, R, n, kR, A sd = s[:,None] - s Rv = np.sqrt((sd**2).sum(2)) nv = sd/Rv[:,:,None] kRv = Rv*kprop Av = (1./(kRv*kRv)) - im/kRv # Vectorized calculation for: "nxn = np.outer(n, n)" nxnv = nv[:,:,:,None]*nv[:,:,None,:] # Vectorized calculation for: "(3*A-1)*nxn + (1-A)*I" P = (3*Av[:,:,None,None]-1)*nxnv + (1-Av[:,:,None,None])*I # Vectorized calculation for: "-alpha*(k2*np.exp(im*kR))/R" multv = -alpha*(k2*np.exp(im*kRv))/Rv # Vectorized calculation for: "nxn *= -alpha*(k2*np.exp(im*kR))/R" outv = P*multv[:,:,None,None] # Simulate ELSE part of the conditional statement"if i != j:" # with masked setting to I on the last two dimensions outv[np.eye((N),dtype=bool)] = I return outv.transpose(0,2,1,3).reshape(N*3,-1)

Run-time checks and output checks -

Case No. 1:

 In [703]: N = 10 ...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3) ...: alpha = 3j ...: kprop = 5.4 ...: In [704]: out_org = interaction(s,alpha,kprop) ...: out_vect = vectorized_interaction(s,alpha,kprop) ...: print np.allclose(np.real(out_org),np.real(out_vect)) ...: print np.allclose(np.imag(out_org),np.imag(out_vect)) ...: True True In [705]: %timeit interaction(s,alpha,kprop) 100 loops, best of 3: 7.6 ms per loop In [706]: %timeit vectorized_interaction(s,alpha,kprop) 1000 loops, best of 3: 304 µs per loop

Case No. 2:

 In [707]: N = 100 ...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3) ...: alpha = 3j ...: kprop = 5.4 ...: In [708]: out_org = interaction(s,alpha,kprop) ...: out_vect = vectorized_interaction(s,alpha,kprop) ...: print np.allclose(np.real(out_org),np.real(out_vect)) ...: print np.allclose(np.imag(out_org),np.imag(out_vect)) ...: True True In [709]: %timeit interaction(s,alpha,kprop) 1 loops, best of 3: 826 ms per loop In [710]: %timeit vectorized_interaction(s,alpha,kprop) 100 loops, best of 3: 14 ms per loop

Case No. 3:

 In [711]: N = 900 ...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3) ...: alpha = 3j ...: kprop = 5.4 ...: In [712]: out_org = interaction(s,alpha,kprop) ...: out_vect = vectorized_interaction(s,alpha,kprop) ...: print np.allclose(np.real(out_org),np.real(out_vect)) ...: print np.allclose(np.imag(out_org),np.imag(out_vect)) ...: True True In [713]: %timeit interaction(s,alpha,kprop) 1 loops, best of 3: 1min 7s per loop In [714]: %timeit vectorized_interaction(s,alpha,kprop) 1 loops, best of 3: 1.59 s per loop

Cython acceleration is not as great as expected

More articles: