For the current OpenCL GPGPU project, I need to sort the elements in the array according to some key with 64 possible values. I need the last array so that all elements with the same key are contiguous. It is enough to have an associative array new_index[old_index] as the result of this task.
I divided the task into two parts. Firstly, for each possible key (bucket), I count the number of elements with this key (which are in this bucket). I scan this array (generate the sum of the prefix), which indicates a new range of element indices for each bucket, for example, the "starting" indices for each bucket.
Then the second step is to assign a new index to each element. If I implemented this on a processor, the algorithm would be something like this:
for all elements e: new_index[e] = bucket_start[bucket(e)]++
Of course, this does not work on the GPU. Each element needs read-write access to the bucket_start array, which is essentially a synchronization between all work items, which is the worst thing we can do.
The idea is to put some calculations in workgroups. But I'm not sure how this should be done exactly, since I have no experience in GPGPU computing.
In global memory, we have the initial bucket array initialized with the sum of the prefix, as indicated above. Access this mutexed array using an int atom. (I'm new to this, so maybe a few words here.)
Each workgroup is implicitly assigned a part of the array of input elements. It uses a local bucket array containing the new indexes, relative to the start of the (global) bucket, which we do not yet know. After filling out one of these “local buffers,” the workgroup should write the local buffers to the global array. To do this, it blocks access to the bucket global start array, increases these values according to the current sizes of the local bucket, unlocks it, and then can write the result to the new_index global array (by adding the appropriate offset). This process is repeated until all assigned elements are processed.
Two questions arise:
Is this a good approach? I know that reading and writing from / to global memory is most likely the bottleneck here, especially since I am trying to get synchronized access to at least only a small fraction of) global memory. But perhaps there is a much better approach to do this, perhaps using kernel decomposition. Please note that I try to avoid reading data from the GPU to the processor during cores (to avoid queuing on the OpenCL command line, which is also bad since I was tough).
In the above algorithm, how to implement the locking mechanism ? Will something like the following code work? In particular, I expect problems when the hardware runs work items that are “truly parallel” in SIMD groups, such as Nvidia “warps”. In my current code, all workgroup members will try to get a SIMD lock. Should I limit this to only the first work item? And use barriers for local synchronization?
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable __kernel void putInBuckets(__global uint *mutex, __global uint *bucket_start, __global uint *new_index) { __local bucket_size[NUM_BUCKETS]; __local bucket[NUM_BUCKETS][LOCAL_MAX_BUCKET_SIZE]; // local "new_index" while (...) { // process a couple of elements locally until a local bucket is full ... // "lock" while(atomic_xchg(mutex, 1)) { } // "critical section" __local uint l_bucket_start[NUM_BUCKETS]; for (int b = 0; b < NUM_BUCKETS; ++b) { l_bucket_start[b] = bucket_start[b]; // where should we write? bucket_start[b] += bucket_size[b]; // update global offset } // "unlock" atomic_xchg(mutex, 0); // write to global memory by adding the offset for (...) new_index[...] = ... + l_bucket_start[b]; } }