Performance difference between numpy and matlab

I am calculating the backpropagation algorithm for a sparse autocoder. I implemented it in python using numpy and in matlab . The code is almost the same, but the performance is very different. The time it takes matlab to complete the task is 0.252454 seconds, and numpy is 0.973672151566, which is almost four times as long. I will call this code several times later in the minimization task, so this difference leads to a few minutes of delay between implementations. Is this normal behavior? How can I improve performance in numpy?

Numpy implementation:

Sparse.rho - parameter, sparse.nodes - the number of nodes in the hidden layer (25), sparse.input (64) the number of nodes in the input layer, theta1 and theta2 - weight matrices for the first and second levels, respectively, with sizes 25x64 and 64x25, m is 10000, rhoest has dimension (25), x has dimension 10000x64, a3 10000x64 and a2 10000x25.

UPDATE : I introduced changes to the code, following some ideas of the answers. Performance is now countless: 0.65 versus Matlab: 0.25.

 partial_j1 = np.zeros(sparse.theta1.shape) partial_j2 = np.zeros(sparse.theta2.shape) partial_b1 = np.zeros(sparse.b1.shape) partial_b2 = np.zeros(sparse.b2.shape) t = time.time() delta3t = (-(x-a3)*a3*(1-a3)).T for i in range(m): delta3 = delta3t[:,i:(i+1)] sum1 = np.dot(sparse.theta2.T,delta3) delta2 = ( sum1 + sum2 ) * a2[i:(i+1),:].T* (1 - a2[i:(i+1),:].T) partial_j1 += np.dot(delta2, a1[i:(i+1),:]) partial_j2 += np.dot(delta3, a2[i:(i+1),:]) partial_b1 += delta2 partial_b2 += delta3 print "Backprop time:", time.time() -t 

Matlab implementation:

 tic for i = 1:m delta3 = -(data(i,:)-a3(i,:)).*a3(i,:).*(1 - a3(i,:)); delta3 = delta3.'; sum1 = W2.'*delta3; sum2 = beta*(-sparsityParam./rhoest + (1 - sparsityParam) ./ (1.0 - rhoest) ); delta2 = ( sum1 + sum2 ) .* a2(i,:).' .* (1 - a2(i,:).'); W1grad = W1grad + delta2* a1(i,:); W2grad = W2grad + delta3* a2(i,:); b1grad = b1grad + delta2; b2grad = b2grad + delta3; end toc 
+21
performance python numpy matlab backpropagation
Aug 29 '13 at 16:43
source share
3 answers

It would be wrong to say: "Matlab is always faster than NumPy," or vice versa. Often their performance is comparable. When using NumPy to get good, you should keep in mind that NumPy speed comes from invoking basic functions written in C / C ++ / Fortran. It works well when you apply these functions to whole arrays. In general, you get lower performance when you call this NumPy function on smaller arrays or scalars in a Python loop.

What is wrong with the Python loop you ask? Each iteration through a Python loop calls the next method. Each use of indexing [] is a call to __getitem__ . Each += is a __iadd__ call. Each dotted search attribute (for example, like np.dot ) includes function calls. These function calls greatly complicate the speed. These hooks give Python expressive power - indexing for strings means something else than indexing, for example, for dicts. The same syntax, different meanings. The magic is achieved by providing objects with different __getitem__ methods.

But this expressive power is expensive. Therefore, when you do not need all that dynamic expressiveness in order to improve performance, try to limit yourself. The NumPy function calls entire arrays.

So remove the for-loop; Use vectorized equations whenever possible. For example, instead of

 for i in range(m): delta3 = -(x[i,:]-a3[i,:])*a3[i,:]* (1 - a3[i,:]) 

you can calculate delta3 for each i at the same time:

 delta3 = -(x-a3)*a3*(1-a3) 

While there is a vector in for-loop delta3 , using the vectorized delta3 equation is a matrix.




Some of the calculations in the for-loop are independent of i and therefore need to be raised outside the loop. For example, sum2 looks like a constant:

 sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest) ) 



Here is an example with an alternative implementation ( alt ) of your code ( orig ).

My timeit test shows a 6.8x speed improvement :

 In [52]: %timeit orig() 1 loops, best of 3: 495 ms per loop In [53]: %timeit alt() 10 loops, best of 3: 72.6 ms per loop 



 import numpy as np class Bunch(object): """ http://code.activestate.com/recipes/52308 """ def __init__(self, **kwds): self.__dict__.update(kwds) m, n, p = 10 ** 4, 64, 25 sparse = Bunch( theta1=np.random.random((p, n)), theta2=np.random.random((n, p)), b1=np.random.random((p, 1)), b2=np.random.random((n, 1)), ) x = np.random.random((m, n)) a3 = np.random.random((m, n)) a2 = np.random.random((m, p)) a1 = np.random.random((m, n)) sum2 = np.random.random((p, )) sum2 = sum2[:, np.newaxis] def orig(): partial_j1 = np.zeros(sparse.theta1.shape) partial_j2 = np.zeros(sparse.theta2.shape) partial_b1 = np.zeros(sparse.b1.shape) partial_b2 = np.zeros(sparse.b2.shape) delta3t = (-(x - a3) * a3 * (1 - a3)).T for i in range(m): delta3 = delta3t[:, i:(i + 1)] sum1 = np.dot(sparse.theta2.T, delta3) delta2 = (sum1 + sum2) * a2[i:(i + 1), :].T * (1 - a2[i:(i + 1), :].T) partial_j1 += np.dot(delta2, a1[i:(i + 1), :]) partial_j2 += np.dot(delta3, a2[i:(i + 1), :]) partial_b1 += delta2 partial_b2 += delta3 # delta3: (64, 1) # sum1: (25, 1) # delta2: (25, 1) # a1[i:(i+1),:]: (1, 64) # partial_j1: (25, 64) # partial_j2: (64, 25) # partial_b1: (25, 1) # partial_b2: (64, 1) # a2[i:(i+1),:]: (1, 25) return partial_j1, partial_j2, partial_b1, partial_b2 def alt(): delta3 = (-(x - a3) * a3 * (1 - a3)).T sum1 = np.dot(sparse.theta2.T, delta3) delta2 = (sum1 + sum2) * a2.T * (1 - a2.T) # delta3: (64, 10000) # sum1: (25, 10000) # delta2: (25, 10000) # a1: (10000, 64) # a2: (10000, 25) partial_j1 = np.dot(delta2, a1) partial_j2 = np.dot(delta3, a2) partial_b1 = delta2.sum(axis=1) partial_b2 = delta3.sum(axis=1) return partial_j1, partial_j2, partial_b1, partial_b2 answer = orig() result = alt() for a, r in zip(answer, result): try: assert np.allclose(np.squeeze(a), r) except AssertionError: print(a.shape) print(r.shape) raise 



Tip: Please note that I left in the comments the form of all intermediate arrays. Knowing the shape of the arrays helped me understand what your code does. The shape of arrays can help you make proper use of NumPy functions. Or at least by paying attention to the figures, it can help you understand whether the operation is reasonable. For example, when you calculate

 np.dot(A, B) 

and A.shape = (n, m) and B.shape = (m, p) , then np.dot(A, B) will be an array of the form (n, p) .




This can help build arrays in the C_CONTIGUOUS order (at least when using np.dot ). It can be like 3x acceleration:

Below x same as xf , except that x is C_CONTIGUOUS and xf is F_CONTIGUOUS - the same relationship for y and yf .

 import numpy as np m, n, p = 10 ** 4, 64, 25 x = np.random.random((n, m)) xf = np.asarray(x, order='F') y = np.random.random((m, n)) yf = np.asarray(y, order='F') assert np.allclose(x, xf) assert np.allclose(y, yf) assert np.allclose(np.dot(x, y), np.dot(xf, y)) assert np.allclose(np.dot(x, y), np.dot(xf, yf)) 
Tests

%timeit show the difference in speed:

 In [50]: %timeit np.dot(x, y) 100 loops, best of 3: 12.9 ms per loop In [51]: %timeit np.dot(xf, y) 10 loops, best of 3: 27.7 ms per loop In [56]: %timeit np.dot(x, yf) 10 loops, best of 3: 21.8 ms per loop In [53]: %timeit np.dot(xf, yf) 10 loops, best of 3: 33.3 ms per loop 



Regarding benchmarking in Python:

This can be misleading to use the difference in pairs of time.time() calls to compare code speed in Python. You need to repeat the measurement many times. It is better to disable the automatic garbage collector. It is also important to measure long periods of time (for example, repeating at least 10 seconds) in order to avoid errors due to poor resolution of the clock timer and to reduce the significance of time.time service calls. Instead of writing all this code, Python provides you with a timeit module . I essentially use this at the time of the code snippets, except that I call it through which I linked to , according to time.time , the two code snippets were 1.7 times different, and the tests using timeit showed that the code snippets performed in basically identical amounts of time.

+36
Aug 29 '13 at 17:58
source share

I would start with inplace operations so as not to allocate new arrays every time:

 partial_j1 += np.dot(delta2, a1[i,:].reshape(1,a1.shape[1])) partial_j2 += np.dot(delta3, a2[i,:].reshape(1,a2.shape[1])) partial_b1 += delta2 partial_b2 += delta3 

You can replace this expression:

 a1[i,:].reshape(1,a1.shape[1]) 

with simpler and faster (thanks to Bi Rico ):

 a1[i:i+1] 

In addition, this line:

 sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest)) 

It seems the same in every cycle, you do not need to compromise it.

And perhaps a little optimization, you can replace all occurrences of x[i,:] with x[i] .

Finally, if you can let allocate m times more memory, you can follow the unutbu clause and vectorize the loop:

 for m in range(m): delta3 = -(x[i]-a3[i])*a3[i]* (1 - a3[i]) 

from:

 delta3 = -(x-a3)*a3*(1-a3) 

And you can always use Numba and significantly increase speed without vectorization (and without using more memory).

+3
Aug 29 '13 at 17:09 on
source share

The performance difference between numpy and matlab always disappointed me. In the end, they often come down to the main paw libraries. As far as I know, Matlab uses the full atlas of lapak by default, and numpy uses bright light. Matlab believes that people do not care about space and mass, while numpy believes that people do. A similar question with a good answer.

0
Aug 29 '13 at 17:21
source share



All Articles