In the simple example that you indicated with for k in xrange(4): body of the loop is executed only twice (if r==s ) or three times (if r!=s ) and the initial numpy implementation is lower, the slower the large coefficient. Numpy is optimized for performing calculations over long vectors, and if the vectors are short, the overhead can outweigh the benefits. (And pay attention to this formula, matrices are sliced ββin different dimensions and not indexed in contact, which can only complicate the task for vectorization).
import numpy as np distance_matrix_np = np.array(distance_matrix) stream_matrix_np = np.array(stream_matrix) n = 4 def deltaC_np(r, s, sol): delta = 0 sol_r, sol_s = sol[r], sol[s] K = np.array([i for i in xrange(n) if i!=r and i!=s]) return np.sum( (stream_matrix_np[r,K] - stream_matrix_np[s,K]) \ * (distance_matrix_np[sol_s,sol[K]] - distance_matrix_np[sol_r,sol[K]]) + \ (stream_matrix_np[K,r] - stream_matrix_np[K,s]) \ * (distance_matrix_np[sol[K],sol_s] - distance_matrix_np[sol[K],sol_r]))
In this implementation of numpy, and not in a for loop on elements in K , operations apply to all elements in K within numpy. Also note that your mathematical expression may be simplified. Each term in brackets on the left is a negative term in brackets on the right. 
This also applies to your source code. For example, (self._data.distance_matrix[sol[s]][sol[k]] - self._data.distance_matrix[sol[r]][sol[k]]) is -1 times (self._data.distance_matrix[sol[r]][sol[k]] - self._data.distance_matrix[sol[s]][sol[k]]) , so you did unnecessary calculations and your source code can be optimized without using numpy.
Turns out the numpy function's bottleneck is understanding an innocent list
K = np.array([i for i in xrange(n) if i!=r and i!=s])
As soon as it is replaced with a vectorizing code
if r==s: K=np.arange(n-1) K[r:] += 1 else: K=np.arange(n-2) if r<s: K[r:] += 1 K[s-1:] += 1 else: K[s:] += 1 K[r-1:] += 1
the numpy function is much faster.
The runtime graph is shown below (at the bottom bottom of this answer is the original graph before optimizing the numpy function). You can see that it makes sense to use your optimized source code or numpy code depending on how big the matrix is.

The full code is provided below for reference, partly in case someone else can continue it. (The deltaC2 function is your source code, optimized to take into account how to simplify the mathematical expression.)
def deltaC(r, s, sol): delta = 0 sol_r, sol_s = sol[r], sol[s] for k in xrange(n): if k != r and k != s: delta += \ stream_matrix[r][k] \ * (distance_matrix[sol_s][sol[k]] - distance_matrix[sol_r][sol[k]]) + \ stream_matrix[s][k] \ * (distance_matrix[sol_r][sol[k]] - distance_matrix[sol_s][sol[k]]) + \ stream_matrix[k][r] \ * (distance_matrix[sol[k]][sol_s] - distance_matrix[sol[k]][sol_r]) + \ stream_matrix[k][s] \ * (distance_matrix[sol[k]][sol_r] - distance_matrix[sol[k]][sol_s]) return delta import numpy as np def deltaC_np(r, s, sol): delta = 0 sol_r, sol_s = sol[r], sol[s] if r==s: K=np.arange(n-1) K[r:] += 1 else: K=np.arange(n-2) if r<s: K[r:] += 1 K[s-1:] += 1 else: K[s:] += 1 K[r-1:] += 1 #K = np.array([i for i in xrange(n) if i!=r and i!=s]) #TOO SLOW return np.sum( (stream_matrix_np[r,K] - stream_matrix_np[s,K]) \ * (distance_matrix_np[sol_s,sol[K]] - distance_matrix_np[sol_r,sol[K]]) + \ (stream_matrix_np[K,r] - stream_matrix_np[K,s]) \ * (distance_matrix_np[sol[K],sol_s] - distance_matrix_np[sol[K],sol_r])) def deltaC2(r, s, sol): delta = 0 sol_r, sol_s = sol[r], sol[s] for k in xrange(n): if k != r and k != s: sol_k = sol[k] delta += \ (stream_matrix[r][k] - stream_matrix[s][k]) \ * (distance_matrix[sol_s][sol_k] - distance_matrix[sol_r][sol_k]) \ + \ (stream_matrix[k][r] - stream_matrix[k][s]) \ * (distance_matrix[sol_k][sol_s] - distance_matrix[sol_k][sol_r]) return delta import time N=200 elapsed1s = [] elapsed2s = [] elapsed3s = [] ns = range(10,410,10) for n in ns: distance_matrix_np=np.random.uniform(0,n**2,size=(n,n)) stream_matrix_np=np.random.uniform(0,n**2,size=(n,n)) distance_matrix=distance_matrix_np.tolist() stream_matrix=stream_matrix_np.tolist() sol = range(n-1,-1,-1) sol_np = np.array(range(n-1,-1,-1)) Is = np.random.randint(0,n-1,4) Js = np.random.randint(0,n-1,4) total1 = 0 start = time.clock() for reps in xrange(N): for i in Is: for j in Js: total1 += deltaC(i,j, sol) elapsed1 = (time.clock() - start) start = time.clock() total2 = 0 start = time.clock() for reps in xrange(N): for i in Is: for j in Js: total2 += deltaC_np(i,j, sol_np) elapsed2 = (time.clock() - start) total3 = 0 start = time.clock() for reps in xrange(N): for i in Is: for j in Js: total3 += deltaC2(i,j, sol_np) elapsed3 = (time.clock() - start) print n, elapsed1, elapsed2, elapsed3, total1, total2, total3 elapsed1s.append(elapsed1) elapsed2s.append(elapsed2) elapsed3s.append(elapsed3) #Check errors of one method against another #err = 0 #for i in range(min(n,50)): # for j in range(min(n,50)): # err += np.abs(deltaC(i,j,sol)-deltaC_np(i,j,sol_np)) #print err import matplotlib.pyplot as plt plt.plot(ns, elapsed1s, label='Original',lw=2) plt.plot(ns, elapsed3s, label='Optimized',lw=2) plt.plot(ns, elapsed2s, label='numpy',lw=2) plt.legend(loc='upper left', prop={'size':16}) plt.xlabel('matrix size') plt.ylabel('time') plt.show()
And here is the original chart before optimizing list comprehension in deltaC_np
