Lack of acceleration and erroneous results with OpenMP and Cython

Question

Lack of acceleration and erroneous results with OpenMP and Cython

I am trying to speed up a simple piece of code written in Cython using OpenMP. This is a double loop, which for each position in the input array adds a quantity at each control point. Here is the main part of the code:

cimport cython import numpy as np cimport numpy as np cimport openmp DTYPE = np.double ctypedef np.double_t DTYPE_t cdef extern from "math.h" nogil : DTYPE_t sqrt(DTYPE_t) @cython.cdivision(True) @cython.boundscheck(False) def summation(np.ndarray[DTYPE_t,ndim=2] pos, np.ndarray[DTYPE_t,ndim=1] weights, np.ndarray[DTYPE_t, ndim=2] points, int num_threads = 0): from cython.parallel cimport prange, parallel, threadid if num_threads <= 0 : num_threads = openmp.omp_get_num_procs() if num_threads > openmp.omp_get_num_procs() : num_threads = openmp.omp_get_num_procs() openmp.omp_set_num_threads(num_threads) cdef unsigned int nips = len(points) cdef np.ndarray[DTYPE_t, ndim=1] sum_array = np.zeros(nips, dtype = np.float64) cdef np.ndarray[DTYPE_t, ndim=2] sum_array3d = np.zeros((nips,3), dtype = np.float64) cdef unsigned int n = len(weights) cdef unsigned int pi, i, id cdef double dx, dy, dz, dr, weight_i, xi,yi,zi print 'num_threads = ', openmp.omp_get_num_threads() for i in prange(n,nogil=True,schedule='static'): weight_i = weights[i] xi = pos[i,0] yi = pos[i,1] zi = pos[i,2] for pi in range(nips) : dx = points[pi,0] - xi dy = points[pi,1] - yi dz = points[pi,2] - zi dr = 1.0/sqrt(dx*dx + dy*dy + dz*dz) sum_array[pi] += weight_i * dr sum_array3d[pi,0] += weight_i * dx sum_array3d[pi,1] += weight_i * dy sum_array3d[pi,2] += weight_i * dz return sum_array, sum_array3d

I included it in the corresponding test and installation files ( https://gist.github.com/rokroskar/6ed1bfc1a5f8f9c183a6 )

There are two problems:

Firstly, in the current configuration, I cannot get acceleration. The code runs on multiple cores, but timings do not show benefits.

Secondly, the results vary depending on the number of cores indicating that there is a race condition. Shouldn't the amount on the spot decrease? Or is something funny happening because it is nested in a loop? I realized that everything inside prange is executed individually in each thread.

Both of them disappear if I change the order of the loops, but since my outer loop, how it is structured, is now where all the data is read, if I cancel it, the arrays intersect with the number num_thread, which is wasteful, I also tried to put the whole nested loop inside the with parallel(): block with parallel(): and use stream-local buffers explicitly, but could not get it to work.

It’s clear that I missed something in common about how OpenMP should work (although perhaps this is especially important for Cython?), So I would be grateful for the advice!

+6

c python multithreading openmp cython

Rok Nov 19 '13 at 15:51

source share

1 answer

Alex Rubinsteyn · Answer 1 · 2013-11-25T02:08:06+0000

Have you tried switching two loops so that you do not have to read and write multiple streams from the same places? I am sure that these loops are not automatically advertised in OpenMP abbreviations and increments such as "sum_array3d [pi, 0] + = weight_i * dx" are not atomic.

Also, since the calculation is relatively simple, Cython may be redundant, and you can get away with Parakeet or Numba .

Parakeet will use OpenMP for parallel execution of parallel actions by default. You will have to rewrite your code to look something like this:

 @parakeet.jit def summation(pos, weights, points): n_points = len(points) n_weights = len(weights) sum_array3d = np.zeros((n_points,3)) def compute(i): pxi = points[i, 0] pyi = points[i, 1] pzi = points[i, 2] total = 0.0 for j in xrange(n_weights): weight_j = weights[j] xj = pos[j,0] yj = pos[j,1] zj = pos[j,2] dx = pxi - pos[j, 0] dy = pyi - pos[j, 1] dz = pzi - pos[j, 2] dr = 1.0/np.sqrt(dx*dx + dy*dy + dz*dz) total += weight_j * dr sum_array3d[i,0] += weight_j * dx sum_array3d[i,1] += weight_j * dy sum_array3d[i,2] += weight_j * dz return total sum_array = np.array([compute(i) for i in xrange(n_points)]) return sum_array, sum_array3d

As for Numba, I'm not sure if the prange construct introduced it in the free version.

edit: A slap in the face, sorry, I missed the part of your question where you considered switching loops.

Lack of acceleration and erroneous results with OpenMP and Cython

More articles: