Your code uses a single thread that processes each of your lines separately. As a result, you are starving for fast memory (registers, L1 cache, shared memory). You allocate at least 1600 bytes to each stream - that's a lot! You want to stop at about 128 bytes per stream (32 registers of 32 bits each). Secondly, you use local arrays addressed at runtime - these arrays will be poured into local memory, delete your L1 cache and go back to global memory (1600B x 32 threads give 51 KB, which is already at or above the limits of shmem / L1).
For this reason, I would suggest processing one row for each block of 64 or 128 threads and saving the sorting of the string in shared memory. Bubble is actually very easy to implement in parallel:
__shared__ float values[nCols]; ... load the data ... __syncthreads(); for (int i = 0; i < nCols/2; i++) { int j = threadIdx.x; if (j % 2 == 0 && j<nCols-1) if (values[j+1] < values[j]) swap(values[j+1], values[j]); __syncthreads(); if (j % 2 == 1 && j<nCols-1) if (values[j+1] < values[j]) swap(values[j+1], values[j]); __syncthreads(); }
Please note that your inner for j = ... loop is replaced with threadIdx , but the basic idea of ββthe algorithm remains the same. At each iteration, I first exchange the bubbles only on even pairs, and then only on odd pairs to avoid parallel conflicts.
I assume that nCols lower than the dimension of your block, which is easily achievable for 100 elements.
There are many ways to improve the above code, for example
- Reduce the number of threads in half and assume
j=threadIdx.x*2 for the first half of the loop and j=threadIdx.x*2+1 for the second half. Thus, the thread does not remain without work. - Use only 32 threads, each of which processes two
j values ββsequentially. Thus, your problem will correspond to one deformation, allowing you to completely abandon __syncthreads() . With 32 threads, you could use the warp built-in variables. - Experimenting with
#pragma unroll , although the amount of product code may not be feasible. Profiling will help.
Also consider experiments with hard-coded merge sorting instead of bubble sorting. If my memory serves me correctly, when I implemented sorting size bubbles and sorting merge with expanded loops, merge sorting is almost twice as fast as sorting bubbles. Please note that this was several years ago, on the first generation of cards with CUDA support.
source share