Cython: make prange parallelization thread safe

Cython starter is here. I am trying to speed up the calculation of certain pair statistics (in several cells) using several threads. In particular, I use prange from cython.parallel, which internally uses openMP.

The following minimal example illustrates the problem (compiling with Jupyter notebook Cython magic).

Notebook setup:

%load_ext Cython
import numpy as np

Cython code:

%%cython --compile-args=-fopenmp --link-args=-fopenmp -a

from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel

@boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):

    cdef: 
        int N = X.shape[0]
        int nbins = bins.shape[0]
        double Xij,Yij
        double[:] Z = np.zeros(nbins,dtype=np.float64)
        int i,j,b

    with nogil, parallel(num_threads=num_threads):
        for i in prange(N,schedule='static',chunksize=1):
            for j in range(i):
                #some pairwise quantities
                Xij = X[i]-X[j]
                Yij = 0.5*(X[i]+X[j])
                #check if in bin
                for b in range(nbins):
                    if (Xij < bins[b,0]) or (Xij > bins[b,1]):
                        continue
                    Z[b] += Xij*Yij

    return np.asarray(Z)

mock data and bin

X = np.random.rand(10000)
bin_edges = np.linspace(0.,1,11)
bins = np.array([bin_edges[:-1],bin_edges[1:]]).T
bins = bins.copy(order='C')

Terms through

%timeit my_parallel_statistic(X,bins,1)
%timeit my_parallel_statistic(X,bins,4)

gives

1 loop, best of 3: 728 ms per loop
1 loop, best of 3: 330 ms per loop

which is not ideal scaling, but this is not the main point of the question. (But let me know if you have suggestions other than adding regular decorators or tweaking the prange arguments.)

However, this calculation is apparently not thread safe:

Z1 = my_parallel_statistic(X,bins,1)
Z4 = my_parallel_statistic(X,bins,4)
np.allclose(Z1,Z4)

shows a significant difference between the two results (up to 20% in this example).

, ,

Z[b] += Xij*Yij

. , , .

Xij Yij , . , Xij Yij , , N , 100 000 x 100 000 numpy ( !).

( ​​ ):

CPU(s): 8
Model name: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
OS: Red Hat Linux v6.8
Memory: 16 GB
+4
2

, Z[b] += Xij*Yij .

atomic critical. Cython , - Z.

, . () . malloc 'd, np. . (num_threads, nbins) , - . , .

numpy "2D" . , 64 , . . .

%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
cimport openmp

@boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):

    cdef: 
        int N = X.shape[0]
        int nbins = bins.shape[0]
        double Xij,Yij
        # pad local data to 64 byte avoid false sharing of cache-lines
        int nbins_padded = (((nbins - 1) // 8) + 1) * 8
        double[:] Z_local = np.zeros(nbins_padded * num_threads,dtype=np.float64)
        double[:] Z = np.zeros(nbins)
        int i,j,b, bb, tid

    with nogil, parallel(num_threads=num_threads):
        tid = openmp.omp_get_thread_num()
        for i in prange(N,schedule='static',chunksize=1):
            for j in range(i):
                #some pairwise quantities
                Xij = X[i]-X[j]
                Yij = 0.5*(X[i]+X[j])
                #check if in bin
                for b in range(nbins):
                    if (Xij < bins[b,0]) or (Xij > bins[b,1]):
                        continue
                    Z_local[tid * nbins_padded + b] += Xij*Yij
    for tid in range(num_threads):
        for bb in range(nbins):
            Z[bb] += Z_local[tid * nbins_padded + bb]


    return np.asarray(Z)

4- 720 ms/191 ms, 3,6. -. .

+4

, Z .

, num_threads Z, cdef double[:] Z = np.zeros((num_threads, nbins), dtype=np.float64), 0 prange.

return np.sum(Z, axis=0)

Cython- with gil , . C, , OpenMP, .

+1

Source: https://habr.com/ru/post/1669973/


All Articles