Since this question was answered, many events have occurred : around the numbers npr , numba and cython . The purpose of this answer is to take these possibilities into account.
But first let me state the obvious: no matter how you map a Python function to a numpy array, it remains a Python function, which means for each evaluation:
- The numpy-array element must be converted to a Python object (e.g.
Float ). - all calculations are performed with Python objects, which means there is overhead for the interpreter, dynamic dispatching, and immutable objects.
Thus, what mechanism is used to loop through the array does not play a big role due to the above-mentioned costs - it remains much slower than using a simple vectorization.
Let's look at the following example:
# numpy-functionality def f(x): return x+2*x*x+4*x*x*x
np.vectorize as a representative of the pure Python approach class. Using perfplot (see the code in the appendix to this answer), we get the following runtime:

We see that the numpy approach is 10-100 times faster than the version in pure Python. Probably, the performance degradation with large arrays is due to the fact that the data no longer fits in the cache.
You can often hear that NumPy's performance is as good as possible because it is pure C under the hood. However, there are many opportunities for improvement!
The vectorized numpy version uses a lot of extra memory and memory accesses. Numexp-library tries to arrange numpy arrays and thus get better cache usage:
# less cache misses than numpy-functionality import numexpr as ne def ne_f(x): return ne.evaluate("x+2*x*x+4*x*x*x")
It leads to the following comparison:

I canโt explain everything on the chart above: at first we see the big overhead for the numbersxpr library, but since it uses cache better, it is about 10 times faster for large arrays!
Another approach is to do a jit compilation of the function and thus get a real UFunc in pure C. This is Numba's approach:
# runtime generated C-function as ufunc import numba as nb @nb.vectorize(target="cpu") def nb_vf(x): return x+2*x*x+4*x*x*x
This is 10 times faster than the original approach:

However, the task is embarrassingly parallelized, so we can also use prange to compute the loop in parallel:
@nb.njit(parallel=True) def nb_par_jitf(x): y=np.empty(x.shape) for i in nb.prange(len(x)): y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i] return y
As expected, the parallel function is slower for small inputs, but faster (almost 2 times) for large sizes:

While numba specializes in optimizing operations with numpy arrays, Cython is a more general tool. It is more difficult to extract the same performance as with numba - it often drops to llvm (numba) compared to the local compiler (gcc / MSVC):
%%cython -c=/openmp -a import numpy as np import cython #single core: @cython.boundscheck(False) @cython.wraparound(False) def cy_f(double[::1] x): y_out=np.empty(len(x)) cdef Py_ssize_t i cdef double[::1] y=y_out for i in range(len(x)): y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i] return y_out #parallel: from cython.parallel import prange @cython.boundscheck(False) @cython.wraparound(False) def cy_par_f(double[::1] x): y_out=np.empty(len(x)) cdef double[::1] y=y_out cdef Py_ssize_t i cdef Py_ssize_t n = len(x) for i in prange(n, nogil=True): y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i] return y_out
Cython leads to slightly slower functions:

Conclusion
Obviously, testing for only one function proves nothing. It should also be borne in mind that for the selected example function, the memory bandwidth was a bottleneck for sizes exceeding 10 ^ 5 elements - thus, we had the same performance for numba, figurexpr and cython in this area.
However, based on this research and my experience, I would say that numba seems to be the easiest tool with the best features.
Work schedule with perflot -package:
import perfplot perfplot.show( setup=lambda n: np.random.rand(n), n_range=[2**k for k in range(0,24)], kernels=[ f, vf, ne_f, nb_vf, nb_par_jitf, cy_f, cy_par_f, ], logx=True, logy=True, xlabel='len(x)' )