Cython starter is here. I am trying to speed up the calculation of certain pair statistics (in several cells) using several threads. In particular, I use prange from cython.parallel, which internally uses openMP.
The following minimal example illustrates the problem (compiling with Jupyter notebook Cython magic).
Notebook setup:
%load_ext Cython
import numpy as np
Cython code:
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
@boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
double[:] Z = np.zeros(nbins,dtype=np.float64)
int i,j,b
with nogil, parallel(num_threads=num_threads):
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z[b] += Xij*Yij
return np.asarray(Z)
mock data and bin
X = np.random.rand(10000)
bin_edges = np.linspace(0.,1,11)
bins = np.array([bin_edges[:-1],bin_edges[1:]]).T
bins = bins.copy(order='C')
Terms through
%timeit my_parallel_statistic(X,bins,1)
%timeit my_parallel_statistic(X,bins,4)
gives
1 loop, best of 3: 728 ms per loop
1 loop, best of 3: 330 ms per loop
which is not ideal scaling, but this is not the main point of the question. (But let me know if you have suggestions other than adding regular decorators or tweaking the prange arguments.)
However, this calculation is apparently not thread safe:
Z1 = my_parallel_statistic(X,bins,1)
Z4 = my_parallel_statistic(X,bins,4)
np.allclose(Z1,Z4)
shows a significant difference between the two results (up to 20% in this example).
, ,
Z[b] += Xij*Yij
. , , .
Xij Yij , . , Xij Yij , , N , 100 000 x 100 000 numpy ( !).
( ):
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
OS: Red Hat Linux v6.8
Memory: 16 GB