Numpy routines don't seem so fast

Question

Numpy routines don't seem so fast

I use python to do Bayesian statistics. I coded it in python and in Fortran 95. Fortran code is accelerated ... as a coefficient of 100. I expected Fortran to be faster, but I really hoped that with numpy I could get python code close, possibly within a factor of 2 I have profiled python code, and it looks like most of the time wasted doing the following things:

scipy.stats.rvs: taking a random draw from a distribution. I do this ~ 19000 times and takes a total time of 3,552 seconds

numpy.slogdet: calculating the logarithm of the matrix determinant. I do this ~ 10,000 and takes a total of 2.48 s

numpy.solve: solve a linear system: I call this procedure ~ 10,000 times for a total time of 2,557 s

In general, my code runs in ~ 11 seconds, while my fortran code takes 0.092 seconds. This is a joke? I'm really not trying to be unrealistic in my expectations from python, and I certainly don't expect my Python code to be as fast as Fortran .. but it will be 100 times slower. Python should be able to do better than that. Just in case, you are curious, here is the complete conclusion of my profiler: (I do not know why he broke the text into several blocks)

1290611 function calls in 11.296 CPU seconds Ordered by: internal time, function name ncalls tottime percall cumtime percall filename:lineno(function) 18973 0.864 0.000 3.552 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:484(rvs) 9976 0.819 0.000 2.480 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:1559(slogdet) 9976 0.627 0.000 6.659 0.001 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:77(evaluate_posterior) 9384 0.591 0.000 0.753 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:39(construct_R_matrix) 77852 0.533 0.000 0.533 0.000 :0(array) 37946 0.520 0.000 1.489 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:32(_wrapit) 77851 0.423 0.000 0.956 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:216(asarray) 37946 0.360 0.000 0.360 0.000 :0(all) 9976 0.335 0.000 2.557 0.000 /usr/lib64/python2.6/sitepackages/scipy/linalg/basic.py:23(solve) 107799 0.322 0.000 0.322 0.000 :0(len) 109740 0.301 0.000 0.301 0.000 :0(issubclass) 28357 0.294 0.000 0.294 0.000 :0(prod) 9976 0.287 0.000 0.957 0.000 /usr/lib64/python2.6/site-packages/scipy/linalg/lapack.py:45(find_best_lapack_type) 1 0.282 0.282 11.294 11.294 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:199(get_rho_lambda_draws) 9976 0.269 0.000 1.386 0.000 /usr/lib64/python2.6/site-packages/scipy/linalg/lapack.py:60(get_lapack_funcs) 19952 0.263 0.000 0.476 0.000 /usr/lib64/python2.6/site-packages/scipy/linalg/lapack.py:23(cast_to_lapack_prefix) 19952 0.235 0.000 0.669 0.000 /usr/lib64/python2.6/site-packages/numpy/lib/function_base.py:483(asarray_chkfinite) 66833 0.212 0.000 0.212 0.000 :0(log) 18973 0.207 0.000 1.054 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1427(product) 29931 0.205 0.000 0.205 0.000 :0(reduce) 28949 0.187 0.000 0.856 0.000 :0(map) 9976 0.175 0.000 0.175 0.000 :0(dot) 47922 0.163 0.000 0.163 0.000 :0(getattr) 9976 0.157 0.000 0.206 0.000 /usr/lib64/python2.6/site-packages/numpy/lib/twodim_base.py:169(eye) 19952 0.154 0.000 0.271 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:32(loggbeta) 18973 0.151 0.000 0.793 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1548(all) 19953 0.146 0.000 0.146 0.000 :0(any) 9976 0.142 0.000 0.316 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:99(_commonType) 9976 0.133 0.000 0.133 0.000 :0(dgetrf) 18973 0.125 0.000 0.175 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:462(_fix_loc_scale) 39904 0.117 0.000 0.117 0.000 :0(append) 18973 0.105 0.000 0.292 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1461(alltrue) 19952 0.102 0.000 0.102 0.000 :0(zeros) 19952 0.093 0.000 0.154 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:71(isComplexType) 19952 0.090 0.000 0.090 0.000 :0(split) 9976 0.089 0.000 2.569 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:62(get_log_determinant_of_matrix) 19952 0.087 0.000 0.134 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:35(logggamma) 9976 0.083 0.000 0.154 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:139(_fastCopyAndTranspose) 9976 0.076 0.000 0.125 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:157(_assertSquareness) 9976 0.074 0.000 0.097 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:151(_assertRank2) 9976 0.072 0.000 0.119 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:127(_to_native_byte_order) 18973 0.072 0.000 0.072 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:832(_argcheck) 9976 0.072 0.000 0.228 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:901(diagonal) 9976 0.070 0.000 0.070 0.000 :0(arange) 9976 0.061 0.000 0.061 0.000 :0(diagonal) 9976 0.055 0.000 0.055 0.000 :0(sum) 9976 0.053 0.000 0.075 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:84(_realType) 11996 0.050 0.000 0.091 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:1412(_rvs) 9384 0.047 0.000 0.162 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1898(prod) 9976 0.045 0.000 0.045 0.000 :0(sort) 11996 0.041 0.000 0.041 0.000 :0(standard_normal) 9976 0.037 0.000 0.037 0.000 :0(_fastCopyAndTranspose) 9976 0.037 0.000 0.037 0.000 :0(hasattr) 9976 0.037 0.000 0.037 0.000 :0(range) 6977 0.034 0.000 0.055 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:3731(_rvs) 9977 0.027 0.000 0.027 0.000 :0(max) 9976 0.023 0.000 0.023 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:498(isfortran) 9977 0.022 0.000 0.022 0.000 :0(min) 9976 0.022 0.000 0.022 0.000 :0(get) 6977 0.021 0.000 0.021 0.000 :0(uniform) 1 0.001 0.001 11.295 11.295 <string>:1(<module>) 1 0.001 0.001 11.296 11.296 profile:0(get_rho_lambda_draws(correlations,energies,rho_priors,lambda_e_prior,lambda_z_prior,candidate_sig2_rhos,candidate_sig2_lambda_e,candidate_sig2_lambda_z,3000)) 2 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:445(__call__) 1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:385(__init__) 1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:175(_array2string) 2 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:475(_digits) 2 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:309(_extendLine) 1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:317(_formatArray) 1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1477(any) 1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:243(array2string) 1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:1390(array_str) 1 0.000 0.000 0.000 0.000 :0(compress) 1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:394(fillFormat) 6 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:2166(geterr) 12 0.000 0.000 0.000 0.000 :0(geterrobj) 0 0.000 0.000 profile:0(profiler) 1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1043(ravel) 1 0.000 0.000 0.000 0.000 :0(ravel) 8 0.000 0.000 0.000 0.000 :0(rstrip) 6 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:2070(seterr) 6 0.000 0.000 0.000 0.000 :0(seterrobj) 1 0.000 0.000 0.000 0.000 :0(setprofile)

EDIT:

Here is a copy of the relevant routines

 def get_rho_lambda_draws(correlations, energies, rho_priors, lam_e_prior, lam_z_prior, candidate_sig2_rhos, candidate_sig2_lambda_e, candidate_sig2_lambda_z, ndraws): nBasis = len(correlations[0]) nStruct = len(correlations) rho _draws = [ [0.5 for x in xrange(nBasis)] for y in xrange(ndraws)] lambda_e_draws = [ 5 for x in xrange(ndraws)] lambda_z_draws = [ 5 for x in xrange(ndraws)]  accept_rhos = array([0. for x in xrange(nBasis)]) accept_lambda_e = 0. accept_lambda_z = 0. for i in xrange(1,ndraws): if i % 100 == 0: print i, "REP<---------------------------------------------------------------------------------" #do metropolis to get rho rho_draws[i] = [x for x in rho_draws[i-1]] lambda_e_draws[i] = lambda_e_draws[i-1] lambda_z_draws[i] = lambda_z_draws[i-1] rho_vec = [x for x in rho_draws[i-1]] R_matrix_before =construct_R_matrix(correlations,correlations,rho_vec) post_before = evaluate_posterior(R_matrix_before,rho_vec,energies,lambda_e_draws[i-1],lambda_z_draws[i-1],lam_e_prior,lam_z_prior,rho_priors) index = 0 for j in xrange(nBasis): cand = norm.rvs(rho_draws[i-1][j],scale=candidate_sig2_rhos[j]) if 0.0 < cand < 1.0: rho_vec[j] = cand R_matrix_after = construct_R_matrix(correlations,correlations,rho_vec) post_after = evaluate_posterior(R_matrix_after,rho_vec,energies,lambda_e_draws[i-1],lambda_z_draws[i-1],lam_e_prior,lam_z_prior,rho_priors) metrop_value = post_after - post_before unif = log(uniform.rvs(0,1)) if metrop_value > unif: rho_draws[i][j] = cand post_before = post_after accept_rhos[j] += 1 else: rho_vec[j] = rho_draws[i-1][j] R_matrix = construct_R_matrix(correlations,correlations,rho_vec) cand = norm.rvs(lambda_e_draws[i-1],scale=candidate_sig2_lambda_e) if cand > 0.0: post_after = evaluate_posterior(R_matrix,rho_vec,energies,cand,lambda_z_draws[i-1],lam_e_prior,lam_z_prior,rho_priors) metrop_value = post_after - post_before unif = log(uniform.rvs(0,1)) if metrop_value > unif: lambda_e_draws[i] = cand post_before = post_after accept_lambda_e = accept_lambda_e + 1 cand = norm.rvs(lambda_z_draws[i-1],scale=candidate_sig2_lambda_z) if cand > 0.0: post_after = evaluate_posterior(R_matrix,rho_vec,energies,lambda_e_draws[i],cand,lam_e_prior,lam_z_prior,rho_priors) metrop_value = post_after - post_before unif = log(uniform.rvs(0,1)) if metrop_value > unif: lambda_z_draws[i] = cand post_before = post_after accept_lambda_z = accept_lambda_z + 1 print accept_rhos/ndraws print accept_lambda_e/ndraws print accept_lambda_z/ndraws return [rho_draws,lambda_e_draws,lambda_z_draws] def evaluate_posterior(R_matrix,rho_vec,energies,lambda_e,lambda_z,lam_e_prior,lam_z_prior,rho_prior_params): # from scipy.linalg import solve #from numpy import allclose working_matrix = eye(len(R_matrix))/lambda_e + R_matrix/lambda_z logdet = get_log_determinant_of_matrix(working_matrix) x = solve(working_matrix,energies,sym_pos=True) # if not allclose(dot(working_matrix,x),energies): # exit('solve routine didnt work') rho_priors = sum([loggbeta(rho_vec[j],rho_prior_params[j][0],rho_prior_params[j][1]) for j in xrange(len(rho_vec))]) loggposterior = -.5 * logdet - .5*dot(energies,x) + logggamma(lambda_e,lam_e_prior[0],lam_e_prior[1]) + logggamma(lambda_z,lam_z_prior[0],lam_z_prior[1]) + rho_priors #(a_e-1)*log(lambda_e) - b_e*lambda_e + (a_z-1)*log(lambda_z) - b_z*lambda_z + rho_priors return loggposterior def construct_R_matrix(listone,listtwo,rhos): return prod(rhos[:]**(4*(listone[:,newaxis]-listtwo)**2),axis=2)

(Once again ... I don’t know why it breaks my input into several blocks when I send a message. I hope you can solve it)

+6

performance python fortran

legoses Mar 19 '12 at 23:24

source share

6 answers

sega_sai · Answer 1 · 2012-03-19T23:47:26+0000

It's hard to tell exactly what is going on with your code. But my suspicion is that you only have some data that is not (or cannot be) very vectorized. Because it is obvious that calling .rvs () 19,000 times will be slower than .rvs (size = 19000). Cm:

  In [5]: %timeit x=[scipy.stats.norm().rvs() for i in range(19000)] 1 loops, best of 3: 1.23 s per loop In [6]: %timeit x=scipy.stats.norm().rvs(size=19000) 1000 loops, best of 3: 1.67 ms per loop

So, if you really have a not-so-vectorized code or algorithm, it is expected to be slower than fortran.

ely · Answer 2 · 2012-03-20T00:39:03+0000

Check out the performance page created by SciPy / NumPy people . There are a number of remarkably simple add-ons that support very fast code. Among them: (a) use of the weave module, especially the inline and blitz options. (b) Using Cython to write some of your functions in C, but be able to call and use them in Python.

I do a lot of large-scale scientific computing in Python for statistics, finance, and (in gradient school) computer vision. The reason Python is great for such problems is not because my naive, first hack code will give the fastest solution, but because in Python I can easily interact with many other tasks. I can easily issue Linux commands for other programs, easily read and parse most data files, easily interact with SQL and other data software; I have the entire R statistics library, using OpenCV commands (in a much stronger syntax than the C ++ version), and much more.

When the importance of my task was to manipulate a new dataset and contaminate my hands, feeling the nuances of this data, simplifying Python programming with matplotlib made it much better. Later, when I need to scale things, I can always use PyCUDA, Cython or just rewrite things in C ++ if high performance is required. Since most machines now have multiprocessor processors, the multiprocessor module, as well as mpi4py, allow me to quickly and cheaply turn annoying cycle-style tasks into much shorter tasks without requiring a switch to C ++.

In short, the real usefulness of Python does not come from the language itself, but because you really know about add-ons and add-ons that allow you to cheaply make your small set of common problems quickly performed on data that matters in everyday life.

Real-time embedded communications software will use C ++ for a long time ... the same for high-frequency trading strategies. But then again, professional solutions to these types of things are not really intended for Python. And in some cases , people prefer unusual solutions for this material anyway.

peterhil · Answer 3 · 2012-07-26T10:26:35+0000

Get rid of two loops and two lists by replacing them with Numpy functions and constructs that use numpy.ndarrays. Also do not print between calculations - this is also slow. You can probably increase the speed by 10-50 times by following the tips above.

Also see http://www.scipy.org/PerformancePython/

Finn Årup Nielsen · Answer 4 · 2014-09-19T19:25:30+0000

Usually you should use Numpy or Scipy to calculate scalar values. Use "normal" Python. Extension of the example provided by @sega_sai:

 In [11]: %timeit x = [normalvariate(0, 1) for i in range(190)] 1000 loops, best of 3: 274 µs per loop In [12]: %timeit x = [scipy.stats.norm().rvs() for i in range(190)] 10 loops, best of 3: 180 ms per loop In [13]: %timeit x = scipy.stats.norm().rvs(size=190) 1000 loops, best of 3: 987 µs per loop

This is faster if you make an instance of scipy.stats.norm().rvs

 In [14]: rvs = scipy.stats.norm().rvs In [15]: %timeit x = [rvs() for i in range(190)] 100 loops, best of 3: 3.8 ms per loop In [16]: %timeit x = rvs(size=190) 10000 loops, best of 3: 44 µs per loop

Also note that PyMC complained about Scipy's probability distribution:

"Based on an informal comparison using version 2.0, distributions in PyMC are typically about an order of magnitude faster than their counterparts in SciPy (using version 0.7)."

http://www.map.ox.ac.uk/media/PDF/Patil_et_al_2010.pdf

 import pymc s = pymc.Normal('s', 0, 1) %timeit x = [s.rand() for i in range(190)] 100 loops, best of 3: 3.76 ms per loop

Also note that Scipy runs individually faster at each iteration:

 generate = scipy.stats.norm().rvs %timeit x = [generate() for i in range(190)] 100 loops, best of 3: 7.98 ms per loop

fraxel · Answer 5 · 2012-03-19T23:47:25+0000

Try to do this:

 import psyco psyco.full()

Or using pypy , they can sometimes give significant speed improvements, although pypy does not yet have full numpy support.

armando · Answer 6 · 2012-03-20T02:23:11+0000

I recently posted something about c / C ++ / fortran performance, as well as python Stackoverflow:

comparing python with c / fortran

what I concluded with this post was that it is better to combine python with a low level programming language than using python itself for numerical computations. I actually use F2PY.

Numpy routines don't seem so fast

More articles: