Cuda sorting algorithm. Inside or outside the core?

I have a matrix with a size of 50000x100, and I need to sort each row using Cuda in C ++. My architecture is the K80 NVidia card.

Since the number of columns is small, I am currently running a sorting algorithm inside the kernel. I use a modified bubble algorithm that works on all rows of the matrix.

I am wondering if there is a more efficient way to continue. I tried using thrust :: sort inside my kernel, but it is much slower. I also tried a merge sort algorithm, but the recursive part of the algorithm did not work inside my kernel.

== == edit

here is my core:

__global__ void computeQuantilesKernel(float *matIn, int nRows, int nCols, int nQuantiles, float *outsideValues, float *quantilesAve, int param2) { int idx = blockIdx.x * blockDim.x + threadIdx.x; float values[100];//big enough for 100 columns int keys[100]; int nQuant[100];//big enough for 100 quantiles (percentiles) float thisQuantile[100]; int quant; if (idx >= nRows) return; //read matIn from global memory for (int i = 0; i < nCols; i++) { values[i] = matIn[idx * nCols + i + param2 * nCols * nRows]; keys[i] = i; } //bubble Sort: int i, j; int temp; float tempVal; for (i = 0; i < nCols - 1; i++) { for (j = 0; j < nCols - i - 1; j++) { if (values[j + 1] < values[j]) // ascending order simply changes to < { tempVal = values[j]; // swap elements temp = keys[j]; // swap elements values[j] = values[j + 1]; keys[j] = keys[j + 1]; values[j + 1] = tempVal; keys[j + 1] = temp; } } } //end of bubble sort //reset nQuant and thisQuantile for (int iQuant = 0; iQuant < nQuantiles; iQuant++) { nQuant[iQuant] = 0; thisQuantile[iQuant] = 0; } //Compute sum of outsideValues for each quantile for (int i = 0; i < nCols; i++) { quant = (int)(((float)i + 0.5) / ((float)nCols / (float)nQuantiles));//quantile like Matlab nQuant[quant]++; thisQuantile[quant] += outsideValues[idx * nCols + keys[i]]; } //Divide by the size of each quantile to get averages for (int iQuant = 0; iQuant < nQuantiles; iQuant++) { quantilesAve[idx + nRows * iQuant + param2 * nQuantiles * nRows] = thisQuantile[iQuant] / (float)nQuant[iQuant]; } } 
+5
source share
1 answer

Your code uses a single thread that processes each of your lines separately. As a result, you are starving for fast memory (registers, L1 cache, shared memory). You allocate at least 1600 bytes to each stream - that's a lot! You want to stop at about 128 bytes per stream (32 registers of 32 bits each). Secondly, you use local arrays addressed at runtime - these arrays will be poured into local memory, delete your L1 cache and go back to global memory (1600B x 32 threads give 51 KB, which is already at or above the limits of shmem / L1).

For this reason, I would suggest processing one row for each block of 64 or 128 threads and saving the sorting of the string in shared memory. Bubble is actually very easy to implement in parallel:

 __shared__ float values[nCols]; ... load the data ... __syncthreads(); for (int i = 0; i < nCols/2; i++) { int j = threadIdx.x; if (j % 2 == 0 && j<nCols-1) if (values[j+1] < values[j]) swap(values[j+1], values[j]); __syncthreads(); if (j % 2 == 1 && j<nCols-1) if (values[j+1] < values[j]) swap(values[j+1], values[j]); __syncthreads(); } 

Please note that your inner for j = ... loop is replaced with threadIdx , but the basic idea of ​​the algorithm remains the same. At each iteration, I first exchange the bubbles only on even pairs, and then only on odd pairs to avoid parallel conflicts.

I assume that nCols lower than the dimension of your block, which is easily achievable for 100 elements.

There are many ways to improve the above code, for example

  • Reduce the number of threads in half and assume j=threadIdx.x*2 for the first half of the loop and j=threadIdx.x*2+1 for the second half. Thus, the thread does not remain without work.
  • Use only 32 threads, each of which processes two j values ​​sequentially. Thus, your problem will correspond to one deformation, allowing you to completely abandon __syncthreads() . With 32 threads, you could use the warp built-in variables.
  • Experimenting with #pragma unroll , although the amount of product code may not be feasible. Profiling will help.

Also consider experiments with hard-coded merge sorting instead of bubble sorting. If my memory serves me correctly, when I implemented sorting size bubbles and sorting merge with expanded loops, merge sorting is almost twice as fast as sorting bubbles. Please note that this was several years ago, on the first generation of cards with CUDA support.

+1
source

Source: https://habr.com/ru/post/1265079/


All Articles