I am using cython for the first time to get some speed for a function. The function takes a square matrix A of floating point numbers and outputs a single floating point number. The function that it calculates is the matrix constant

When A is 30 to 30, my code takes about 60 seconds currently on my PC.
In the code below, I have implemented the Balasubramanian-Bax / Franklin-Glynn formula for a permanent wiki page. I called the matrix M.
One tricky part of the code is the array f, which is used to store the index of the next position for translation into array d. Array d contains values ββequal to + -1. Manipulating f and j in a loop is just a smart way to quickly update gray code.
from __future__ import division import numpy as np cimport numpy as np cimport cython DTYPE_int = np.int ctypedef np.int_t DTYPE_int_t DTYPE_float = np.float64 ctypedef np.float64_t DTYPE_float_t @cython.boundscheck(False) # turn off bounds-checking for entire function @cython.wraparound(False) # turn off negative index wrapping for entire function def permfunc(np.ndarray [DTYPE_float_t, ndim =2, mode='c'] M): cdef int n = M.shape[0] cdef np.ndarray[DTYPE_float_t, ndim =1, mode='c' ] d = np.ones(n, dtype=DTYPE_float) cdef int j = 0 cdef int s = 1 cdef np.ndarray [DTYPE_int_t, ndim =1, mode='c'] f = np.arange(n, dtype=DTYPE_int) cdef np.ndarray [DTYPE_float_t, ndim =1, mode='c'] v = M.sum(axis=0) cdef DTYPE_float_t p = 1 cdef int i cdef DTYPE_float_t prod for i in range(n): p *= v[i] while (j < n-1): for i in range(n): v[i] -= 2*d[j]*M[j, i] d[j] = -d[j] s = -s prod = 1 for i in range(n): prod *= v[i] p += s*prod f[0] = 0 f[j] = f[j+1] f[j+1] = j+1 j = f[0] return p/2**(n-1)
I used all the simple optimizations I found in the cython tutorial. Some aspects that I must admit, I do not quite understand. For example, if I create an array of d ints, since the values ββare only ever + -1, the code runs about 10% slower, so I left it as float64.
Is there anything else I can do to speed up the code?
This is the result of cython -a. As you can see, everything in the loop is compiled in C, so the main optimizations worked.

Here is the same function in numpy, which is more than 100 times slower than my current version of cython.
def npperm(M): n = M.shape[0] d = np.ones(n) j = 0 s = 1 f = np.arange(n) v = M.sum(axis=0) p = np.prod(v) while (j < n-1): v -= 2*d[j]*M[j] d[j] = -d[j] s = -s prod = np.prod(v) p += s*prod f[0] = 0 f[j] = f[j+1] f[j+1] = j+1 j = f[0] return p/2**(n-1)
Update Dates
Here are the timings (using ipython) of my version of cython, the version of numpy, and the improved improvement of the cython code. I set the seed for reproducibility.
from scipy.stats import ortho_group import pyximport; pyximport.install() import permlib
Can cython code speed up at all?
I am using gcc and the CPU is AMD FX 8350.