Efficient bucket sorting on GPU

Question

Efficient bucket sorting on GPU

For the current OpenCL GPGPU project, I need to sort the elements in the array according to some key with 64 possible values. I need the last array so that all elements with the same key are contiguous. It is enough to have an associative array new_index[old_index] as the result of this task.

I divided the task into two parts. Firstly, for each possible key (bucket), I count the number of elements with this key (which are in this bucket). I scan this array (generate the sum of the prefix), which indicates a new range of element indices for each bucket, for example, the "starting" indices for each bucket.

Then the second step is to assign a new index to each element. If I implemented this on a processor, the algorithm would be something like this:

 for all elements e: new_index[e] = bucket_start[bucket(e)]++

Of course, this does not work on the GPU. Each element needs read-write access to the bucket_start array, which is essentially a synchronization between all work items, which is the worst thing we can do.

The idea is to put some calculations in workgroups. But I'm not sure how this should be done exactly, since I have no experience in GPGPU computing.

In global memory, we have the initial bucket array initialized with the sum of the prefix, as indicated above. Access this mutexed array using an int atom. (I'm new to this, so maybe a few words here.)

Each workgroup is implicitly assigned a part of the array of input elements. It uses a local bucket array containing the new indexes, relative to the start of the (global) bucket, which we do not yet know. After filling out one of these “local buffers,” the workgroup should write the local buffers to the global array. To do this, it blocks access to the bucket global start array, increases these values according to the current sizes of the local bucket, unlocks it, and then can write the result to the new_index global array (by adding the appropriate offset). This process is repeated until all assigned elements are processed.

Two questions arise:

Is this a good approach? I know that reading and writing from / to global memory is most likely the bottleneck here, especially since I am trying to get synchronized access to at least only a small fraction of) global memory. But perhaps there is a much better approach to do this, perhaps using kernel decomposition. Please note that I try to avoid reading data from the GPU to the processor during cores (to avoid queuing on the OpenCL command line, which is also bad since I was tough).

In the above algorithm, how to implement the locking mechanism ? Will something like the following code work? In particular, I expect problems when the hardware runs work items that are “truly parallel” in SIMD groups, such as Nvidia “warps”. In my current code, all workgroup members will try to get a SIMD lock. Should I limit this to only the first work item? And use barriers for local synchronization?

 #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable __kernel void putInBuckets(__global uint *mutex, __global uint *bucket_start, __global uint *new_index) { __local bucket_size[NUM_BUCKETS]; __local bucket[NUM_BUCKETS][LOCAL_MAX_BUCKET_SIZE]; // local "new_index" while (...) { // process a couple of elements locally until a local bucket is full ... // "lock" while(atomic_xchg(mutex, 1)) { } // "critical section" __local uint l_bucket_start[NUM_BUCKETS]; for (int b = 0; b < NUM_BUCKETS; ++b) { l_bucket_start[b] = bucket_start[b]; // where should we write? bucket_start[b] += bucket_size[b]; // update global offset } // "unlock" atomic_xchg(mutex, 0); // write to global memory by adding the offset for (...) new_index[...] = ... + l_bucket_start[b]; } }

+6

synchronization bucket-sort semaphore opencl gpgpu

leemes May 27 '13 at 23:57

source share

2 answers

Tim child · Answer 1 · 2013-06-12T18:25:05+0000

First, never try to implement a blocking algorithm on a GPU. It will close and stop. This is due to the fact that the graphics processor is a SIMD device, and the threads are not executed independently as on the processor. The GPU runs the specified threads, called WARP / WaveFront, synchronously. Therefore, if one thread in Wave Front stalled, it stops all other threads in Wave Front. If the unlock stream is in the braked wavefront, it will NOT execute and will not unlock the mutex.

Atomic operations are in order.

What you should consider is a blocking approach. See this document for an explanation and sample CUDA code: http://www.cse.iitk.ac.in/users/mainakc/pub/icpads2012.pdf/

It describes free hash tables, linked lists, and skip lists with some sample CUDA code.

The proposed approach is to create a two-level data structure.

The first level is a free list of skins. Each entry in the skip list has a second level structure for a list-locked list for duplicate values. And an atomic counter of the number of records.

Insertion method

1) Create 64 bucket keys 2) Find the key in the pass list 3) If you cannot find the insert in the pass list 4) Insert the data into the linked list 5) increase the atomic counter for this bucket

After the insertion prefix, summarize all the code counters of the skipped lists to find the offset outside.

leemes · Answer 2 · 2014-10-27T15:51:21+0000

I found a much simpler way to add local buffers to global arrays. This requires only two steps, one of which is associated with atomic operations.

The first step is to assign an index to the global target array, where each thread will write its own elements. To do this, we can use atomic_add(__global int*) to add the number of elements to add. Use this function in bucket_start in this particular example. The return value of atomic_add is the old value.

In the second step, we use this return value as the base index for copying local buffers in the target array. If we decide to use a whole thread group for one such add operation, we extend the copying of the local buffer to the global array in the thread group as usual. In the above example of sorting the bucket, we copy several arrays, and when the number of arrays (= the number of buckets) is equal to the size of the workgroup, we can instead assign each stream one bucket that will be copied in the loop.

Efficient bucket sorting on GPU

More articles: