Cub BlockRadixSort: how to deal with large tile sizes or sort multiple tiles?

Question

Cub BlockRadixSort: how to deal with large tile sizes or sort multiple tiles?

When using cub :: BlockRadixSort to sort inside a block, if the number of elements is too large, how do we handle this? If we set the tile size too large, then the shared memory for the temporary storage will soon not be able to hold it. If we divide it into several fragments, how will we process them after sorting each tile?

+1

sorting cuda cub

shaoyl85 Mar 03 '14 at 4:06

source share

1 answer

Robert Crovella · Accepted Answer · 2014-03-03T18:28:54+0000

Caution: I am not an expert cube (far from it).
You might want to look at this question / answer , as I am based on the work that I did there.
Of course, if the size of the problem is large enough, then here
Test the sort version of the cube cub obtained from my previous answer, where there is no copying data to / from global memory, i.e. it is assumed that the data is already resident on-chip, that is, in shared memory.

None of this has been tested extensively, but since I build on block building blocks and check the results in the first two cases, I hope I have not made any serious mistakes. Here's the full test code, and I will make additional comments below:

$ cat t10.cu #include <cub/cub.cuh> #include <stdio.h> #include <stdlib.h> #include <thrust/sort.h> #define nTPB 512 #define ELEMS_PER_THREAD 2 #define RANGE (nTPB*ELEMS_PER_THREAD) #define DSIZE (nTPB*ELEMS_PER_THREAD) #define cudaCheckErrors(msg) \ do { \ cudaError_t __err = cudaGetLastError(); \ if (__err != cudaSuccess) { \ fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ msg, cudaGetErrorString(__err), \ __FILE__, __LINE__); \ fprintf(stderr, "*** FAILED - ABORTING\n"); \ exit(1); \ } \ } while (0) using namespace cub; // GLOBAL CUB BLOCK SORT KERNEL // Specialize BlockRadixSort collective types typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort; __device__ int my_val[DSIZE]; __device__ typename my_block_sort::TempStorage sort_temp_stg; // Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers) __global__ void global_BlockSortKernel() { // Collectively sort the keys my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD)))); } // ORIGINAL CUB BLOCK SORT KERNEL template <int BLOCK_THREADS, int ITEMS_PER_THREAD> __global__ void BlockSortKernel(int *d_in, int *d_out) { // Specialize BlockLoad, BlockStore, and BlockRadixSort collective types typedef cub::BlockLoad<int*, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_TRANSPOSE> BlockLoadT; typedef cub::BlockStore<int*, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_STORE_TRANSPOSE> BlockStoreT; typedef cub::BlockRadixSort<int, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT; // Allocate type-safe, repurposable shared memory for collectives __shared__ union { typename BlockLoadT::TempStorage load; typename BlockStoreT::TempStorage store; typename BlockRadixSortT::TempStorage sort; } temp_storage; // Obtain this block segment of consecutive keys (blocked across threads) int thread_keys[ITEMS_PER_THREAD]; int block_offset = blockIdx.x * (BLOCK_THREADS * ITEMS_PER_THREAD); BlockLoadT(temp_storage.load).Load(d_in + block_offset, thread_keys); __syncthreads(); // Barrier for smem reuse // Collectively sort the keys BlockRadixSortT(temp_storage.sort).Sort(thread_keys); __syncthreads(); // Barrier for smem reuse // Store the sorted segment BlockStoreT(temp_storage.store).Store(d_out + block_offset, thread_keys); } // SHARED MEM CUB BLOCK SORT KERNEL // Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers) template <int BLOCK_THREADS, int ITEMS_PER_THREAD> __global__ void shared_BlockSortKernel(int *d_out) { __shared__ int my_val[BLOCK_THREADS*ITEMS_PER_THREAD]; // Specialize BlockRadixSort collective types typedef BlockRadixSort<int, BLOCK_THREADS, ITEMS_PER_THREAD> my_block_sort; // Allocate shared memory for collectives __shared__ typename my_block_sort::TempStorage sort_temp_stg; // need to extend synthetic data for ELEMS_PER_THREAD > 1 my_val[threadIdx.x*ITEMS_PER_THREAD] = (threadIdx.x + 5); // synth data my_val[threadIdx.x*ITEMS_PER_THREAD+1] = (threadIdx.x + BLOCK_THREADS + 5); // synth data __syncthreads(); // printf("thread %d data = %d\n", threadIdx.x, my_val[threadIdx.x*ITEMS_PER_THREAD]); // Collectively sort the keys my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ITEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ITEMS_PER_THREAD)))); __syncthreads(); // printf("thread %d sorted data = %d\n", threadIdx.x, my_val[threadIdx.x*ITEMS_PER_THREAD]); if (threadIdx.x == clock()){ // dummy to prevent compiler optimization d_out[threadIdx.x*ITEMS_PER_THREAD] = my_val[threadIdx.x*ITEMS_PER_THREAD]; d_out[threadIdx.x*ITEMS_PER_THREAD+1] = my_val[threadIdx.x*ITEMS_PER_THREAD+1];} } int main(){ int *h_data, *h_result; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); h_data=(int *)malloc(DSIZE*sizeof(int)); h_result=(int *)malloc(DSIZE*sizeof(int)); if (h_data == 0) {printf("malloc fail\n"); return 1;} if (h_result == 0) {printf("malloc fail\n"); return 1;} for (int i = 0 ; i < DSIZE; i++) h_data[i] = rand()%RANGE; // first test sorting directly out of global memory global_BlockSortKernel<<<1,nTPB>>>(); //warm up run cudaDeviceSynchronize(); cudaMemcpyToSymbol(my_val, h_data, DSIZE*sizeof(int)); cudaCheckErrors("memcpy to symbol fail"); cudaEventRecord(start); global_BlockSortKernel<<<1,nTPB>>>(); //timing run cudaEventRecord(stop); cudaDeviceSynchronize(); cudaCheckErrors("cub 1 fail"); cudaEventSynchronize(stop); float et; cudaEventElapsedTime(&et, start, stop); cudaMemcpyFromSymbol(h_result, my_val, DSIZE*sizeof(int)); cudaCheckErrors("memcpy from symbol fail"); if(!thrust::is_sorted(h_result, h_result+DSIZE)) { printf("sort 1 fail!\n"); return 1;} printf("global Elapsed time: %fms\n", et); printf("global Kkeys/s: %d\n", (int)(DSIZE/et)); // now test original CUB block sort copying global to shared int *d_in, *d_out; cudaMalloc((void **)&d_in, DSIZE*sizeof(int)); cudaMalloc((void **)&d_out, DSIZE*sizeof(int)); cudaCheckErrors("cudaMalloc fail"); BlockSortKernel<nTPB, ELEMS_PER_THREAD><<<1, nTPB>>>(d_in, d_out); // warm up run cudaMemcpy(d_in, h_data, DSIZE*sizeof(int), cudaMemcpyHostToDevice); cudaEventRecord(start); BlockSortKernel<nTPB, ELEMS_PER_THREAD><<<1, nTPB>>>(d_in, d_out); // timing run cudaEventRecord(stop); cudaDeviceSynchronize(); cudaCheckErrors("cub 2 fail"); cudaEventSynchronize(stop); cudaEventElapsedTime(&et, start, stop); cudaMemcpy(h_result, d_out, DSIZE*sizeof(int), cudaMemcpyDeviceToHost); cudaCheckErrors("cudaMemcpy D to H fail"); if(!thrust::is_sorted(h_result, h_result+DSIZE)) { printf("sort 2 fail!\n"); return 1;} printf("CUB Elapsed time: %fms\n", et); printf("CUB Kkeys/s: %d\n", (int)(DSIZE/et)); // now test shared memory-only version of block sort shared_BlockSortKernel<nTPB, ELEMS_PER_THREAD><<<1, nTPB>>>(d_out); // warm-up run cudaEventRecord(start); shared_BlockSortKernel<nTPB, ELEMS_PER_THREAD><<<1, nTPB>>>(d_out); // timing run cudaEventRecord(stop); cudaDeviceSynchronize(); cudaCheckErrors("cub 3 fail"); cudaEventSynchronize(stop); cudaEventElapsedTime(&et, start, stop); printf("shared Elapsed time: %fms\n", et); printf("shared Kkeys/s: %d\n", (int)(DSIZE/et)); return 0; } $ nvcc -O3 -arch=sm_20 -o t10 t10.cu $ ./t10 global Elapsed time: 0.236960ms global Kkeys/s: 4321 CUB Elapsed time: 0.042816ms CUB Kkeys/s: 23916 shared Elapsed time: 0.040192ms shared Kkeys/s: 25477 $

For this test, I use CUDA 6.0RC, cub v1.2.0 (which is fairly recent), RHEL5.5 / gcc4.1.2 and the Quadro5000 GPU (cc2.0, 11SMs, which is about 40% slower than the GTX480). Here are some observations that happen to me:

The initial cubic (2) sorting speed factor for sorting global memory (1) is approximately 6: 1, which roughly corresponds to the ratio of shared memory bandwidth (~ 1 TB / s) to global memory (~ 150 GB / s).
The initial cub (2) sort has a throughput which, when scaling for SM (11), giving 263MKeys / s, is a significant fraction of the best sorting at the device level that I saw on this device ( traction sorting giving ~ 480MKeys / s)
Sorting only for shared memory is not much faster than the original cub sort, which copies input / output from / to global memory, indicating that copying from global memory to cub temp storage is not a significant part of the total processing time.

A 6: 1 penalty is a big one to pay. Therefore, my recommendation would be to use device-level sorting according to the size of problems larger than what can be easily handled by sorting cub blocks if possible. This allows you to use the experience of some of the best GPU code developers for your sorting and significantly increase the throughput that the device as a whole is capable of.

Please note that therefore I could test under similar conditions, the size of the problem here (512 threads, 2 elements in the thread) does not exceed what you can do in sorting CUB blocks. But it is not difficult to expand the size of the data set to larger values (for example, 1024 elements per stream), which can only be processed (in this context, among these options) using the first approach. If I make such large problems, on my GPU I observe a bandwidth of about 6Mkeys / s to sort the global memory block on my cc2.0 device.

Cub BlockRadixSort: how to deal with large tile sizes or sort multiple tiles?

More articles: