Different thread groups in the CUDA core

I am trying to speed up the cross-correlation function in C using the CUDA core. At the moment, this is what I have:

__global__ void xcorr(cuDoubleComplex *temp1, cuDoubleComplex *temp2, cuDoubleComplex *temp3, int Nb, int binM, int Nspb)
{
   for (int k1 = 0; k1 < Nb; k1++)
   {
       int idx = blockIdx.x * blockDim.x + threadIdx.x;
       for (int j1 = 0; j1 < Nspb; j1++)
       {
           if ((j1 + idx) <(Nspb + binM))
           {
               temp3[idx + k1*(binM + 1)].x += (temp1[idx + j1 + (k1*(binM + Nspb))].x*temp2[j1 + (k1*Nspb)].x) + (temp1[idx + j1 + (k1*(binM + Nspb))].y*temp2[j1 + (k1*Nspb)].y);
               temp3[idx + k1*(binM + 1)].y += (-temp1[idx + j1 + (k1*(binM + Nspb))].x*temp2[j1 + (k1*Nspb)].y) + (temp1[idx + j1 + (k1*(binM + Nspb))].y*temp2[j1 + (k1*Nspb)].x);
           }
       }
    }
}

The result is what I expected, but it still takes some time to work around 50 seconds. When I call the kernel, I do it like this

xcorr << <1, 1000 >> > (cuda_E2, cuda_A2, cuda_temp, Nb, *binM, Nspb);

And I thought it was necessary to send 6 blocks instead of one, with 1000 threads to avoid a loop with j1( Nspb=5000). I tried differently, but I can’t find a way to use two different groups of threads, the first block is the same as I use, and the other 5 - to replace the loop j1. Can someone show me how?

Any help would be appreciated.

+4
source share
1 answer

, if (blockIdx.x == 0) < < 6,1000 → > ?/p >

__global__ void xcorr(...)
{
   if (blockIdx.x==0) {
       // do block zero stuff
   }
   else {
       // what the other blocks shall do
   }
}

1D-/:

  • ​​
  • ,
  • / , ( "" )
0

Source: https://habr.com/ru/post/1688395/


All Articles