I am trying to speed up the cross-correlation function in C using the CUDA core. At the moment, this is what I have:
__global__ void xcorr(cuDoubleComplex *temp1, cuDoubleComplex *temp2, cuDoubleComplex *temp3, int Nb, int binM, int Nspb)
{
for (int k1 = 0; k1 < Nb; k1++)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (int j1 = 0; j1 < Nspb; j1++)
{
if ((j1 + idx) <(Nspb + binM))
{
temp3[idx + k1*(binM + 1)].x += (temp1[idx + j1 + (k1*(binM + Nspb))].x*temp2[j1 + (k1*Nspb)].x) + (temp1[idx + j1 + (k1*(binM + Nspb))].y*temp2[j1 + (k1*Nspb)].y);
temp3[idx + k1*(binM + 1)].y += (-temp1[idx + j1 + (k1*(binM + Nspb))].x*temp2[j1 + (k1*Nspb)].y) + (temp1[idx + j1 + (k1*(binM + Nspb))].y*temp2[j1 + (k1*Nspb)].x);
}
}
}
}
The result is what I expected, but it still takes some time to work around 50 seconds
. When I call the kernel, I do it like this
xcorr << <1, 1000 >> > (cuda_E2, cuda_A2, cuda_temp, Nb, *binM, Nspb);
And I thought it was necessary to send 6 blocks instead of one, with 1000 threads to avoid a loop with j1
( Nspb=5000
). I tried differently, but I can’t find a way to use two different groups of threads, the first block is the same as I use, and the other 5 - to replace the loop j1
. Can someone show me how?
Any help would be appreciated.
source
share