Cython is either slightly faster or slower than pure Python

Question

Cython is either slightly faster or slower than pure Python

I use several methods ( NumPy , Weave and Cython ) to run a Python performance test. Mathematically, mathematically, the code has the value C = AB , where A, B, and C are N x N matrices ( NOTE: this is a matrix product, not an element multiplication).

I wrote 5 different code implementations:

Pure Python (Loop over 2D Python lists)
NumPy (point product of 2D NumPy arrays)
Weaving inline (C ++ loop over two-dimensional arrays)
Cython (Loop over 2D Python lists + static typing)
Cython-Numpy (Loop over 2D NumPy Arrays + Static Typing)

My expectation is that implementations 2 through 5 will be significantly faster than implementation 1. My results, however, point to something else. These are my normalized acceleration results for a relatively clean Python implementation:

python_list: 1.00
numpy_array: 330.09
weave_inline: 30.72
cython_list: 2.80
cython_array: 0.14

I am very pleased with NumPy's performance, however I am less enthusiastic about Weave's performance and Cython's performance makes me cry. All my code is split into two files. Everything is automated, and you just need to run the first file to see all the results. Can someone help me by pointing out what I can do to get the best results?

matmul.py:

 import time import numpy as np from scipy import weave from scipy.weave import converters import pyximport pyximport.install() import cython_matmul as cml def python_list_matmul(A, B): C = np.zeros(A.shape, dtype=float).tolist() A = A.tolist() B = B.tolist() for k in xrange(len(A)): for i in xrange(len(A)): for j in xrange(len(A)): C[i][k] += A[i][j] * B[j][k] return C def numpy_array_matmul(A, B): return np.dot(A, B) def weave_inline_matmul(A, B): code = """ int i, j, k; for (k = 0; k < N; ++k) { for (i = 0; i < N; ++i) { for (j = 0; j < N; ++j) { C(i, k) += A(i, j) * B(j, k); } } } """ C = np.zeros(A.shape, dtype=float) weave.inline(code, ['A', 'B', 'C', 'N'], type_converters=converters.blitz, compiler='gcc') return C N = 100 A = np.random.rand(N, N) B = np.random.rand(N, N) function = [] function.append([python_list_matmul, 'python_list']) function.append([numpy_array_matmul, 'numpy_array']) function.append([weave_inline_matmul, 'weave_inline']) function.append([cml.cython_list_matmul, 'cython_list']) function.append([cml.cython_array_matmul, 'cython_array']) t = [] for i in xrange(len(function)): t1 = time.time() C = function[i][0](A, B) t2 = time.time() t.append(t2 - t1) print function[i][1] + ' \t: ' + '{:10.6f}'.format(t[0] / t[-1])

cython_matmul.pyx:

 import numpy as np cimport numpy as np import cython cimport cython DTYPE = np.float ctypedef np.float_t DTYPE_t @cython.boundscheck(False) @cython.wraparound(False) @cython.nonecheck(False) cpdef cython_list_matmul(A, B): cdef int i, j, k cdef int N = len(A) A = A.tolist() B = B.tolist() C = np.zeros([N, N]).tolist() for k in xrange(N): for i in xrange(N): for j in xrange(N): C[i][k] += A[i][j] * B[j][k] return C @cython.boundscheck(False) @cython.wraparound(False) @cython.nonecheck(False) cpdef cython_array_matmul(np.ndarray[DTYPE_t, ndim=2] A, np.ndarray[DTYPE_t, ndim=2] B): cdef int i, j, k, N = A.shape[0] cdef np.ndarray[DTYPE_t, ndim=2] C = np.zeros([N, N], dtype=DTYPE) for k in xrange(N): for i in xrange(N): for j in xrange(N): C[i][k] += A[i][j] * B[j][k] return C

+4

python python-2.7 numpy scipy cython

Aeronaelius Jun 10 '13 at 15:29

source share

1 answer

Nikita Nemkin · Accepted Answer · 2013-06-10T17:08:48+0000

Python lists and high-performance maths are incompatible, forget about cython_list_matmul .

The only problem with your cython_array_matmul is the misuse of indexing. It should be

 C[i,k] += A[i,j] * B[j,k]

The way numpy arrays are indexed in Python and what Cython syntax optimizes. With this change, you should get decent performance.

The Cython annotation function really helps in identifying optimization problems like this. You may have noticed that A[i][j] creates a ton of Python API calls, but A[i,j] does not.

Also, if you initialize all entries manually, np.empty more suitable than np.zeros .

Cython is either slightly faster or slower than pure Python

More articles: