I am trying to speed up a simple piece of code written in Cython using OpenMP. This is a double loop, which for each position in the input array adds a quantity at each control point. Here is the main part of the code:
cimport cython import numpy as np cimport numpy as np cimport openmp DTYPE = np.double ctypedef np.double_t DTYPE_t cdef extern from "math.h" nogil : DTYPE_t sqrt(DTYPE_t) @cython.cdivision(True) @cython.boundscheck(False) def summation(np.ndarray[DTYPE_t,ndim=2] pos, np.ndarray[DTYPE_t,ndim=1] weights, np.ndarray[DTYPE_t, ndim=2] points, int num_threads = 0): from cython.parallel cimport prange, parallel, threadid if num_threads <= 0 : num_threads = openmp.omp_get_num_procs() if num_threads > openmp.omp_get_num_procs() : num_threads = openmp.omp_get_num_procs() openmp.omp_set_num_threads(num_threads) cdef unsigned int nips = len(points) cdef np.ndarray[DTYPE_t, ndim=1] sum_array = np.zeros(nips, dtype = np.float64) cdef np.ndarray[DTYPE_t, ndim=2] sum_array3d = np.zeros((nips,3), dtype = np.float64) cdef unsigned int n = len(weights) cdef unsigned int pi, i, id cdef double dx, dy, dz, dr, weight_i, xi,yi,zi print 'num_threads = ', openmp.omp_get_num_threads() for i in prange(n,nogil=True,schedule='static'): weight_i = weights[i] xi = pos[i,0] yi = pos[i,1] zi = pos[i,2] for pi in range(nips) : dx = points[pi,0] - xi dy = points[pi,1] - yi dz = points[pi,2] - zi dr = 1.0/sqrt(dx*dx + dy*dy + dz*dz) sum_array[pi] += weight_i * dr sum_array3d[pi,0] += weight_i * dx sum_array3d[pi,1] += weight_i * dy sum_array3d[pi,2] += weight_i * dz return sum_array, sum_array3d
I included it in the corresponding test and installation files ( https://gist.github.com/rokroskar/6ed1bfc1a5f8f9c183a6 )
There are two problems:
Firstly, in the current configuration, I cannot get acceleration. The code runs on multiple cores, but timings do not show benefits.
Secondly, the results vary depending on the number of cores indicating that there is a race condition. Shouldn't the amount on the spot decrease? Or is something funny happening because it is nested in a loop? I realized that everything inside prange is executed individually in each thread.
Both of them disappear if I change the order of the loops, but since my outer loop, how it is structured, is now where all the data is read, if I cancel it, the arrays intersect with the number num_thread, which is wasteful, I also tried to put the whole nested loop inside the with parallel(): block with parallel(): and use stream-local buffers explicitly, but could not get it to work.
Itβs clear that I missed something in common about how OpenMP should work (although perhaps this is especially important for Cython?), So I would be grateful for the advice!