CUDA: built-in for the core of the loop

Question

CUDA: built-in for the core of the loop

I have a code that I want to make in the cuda core. Here:

for (r = Y; r < Y + H; r+=2) { ch1RowSum = ch2RowSum = ch3RowSum = 0; for (c = X; c < X + W; c+=2) { chan1Value = //some calc'd value chan3Value = //some calc'd value chan2Value = //some calc'd value ch2RowSum += chan2Value; ch3RowSum += chan3Value; ch1RowSum += chan1Value; } ch1Mean += ch1RowSum / W; ch2Mean += ch2RowSum / W; ch3Mean += ch3RowSum / W; }

If it needs to be divided into two kernels: one for calculating RowSums and one for calculating Tools, and how should I handle the fact that my loop indices do not start from zero and end with N?

+4

c ++ cuda

Derek Jan 22 '11 at 22:11

source share

1 answer

jmilloy · Accepted Answer · 2011-01-22T23:22:02+0000

Suppose you have a kernel that computes three values. Each stream in your configuration will calculate three values for each pair (r, c).

 __global__ value_kernel(Y, H, X, W) { r = blockIdx.x + Y; c = threadIdx.x + W; chan1value = ... chan2value = ... chan3value = ... }

I do not believe that you can calculate the sum (completely parallel, at least) in the specified kernel. You cannot use + = as you are already higher. You can put all this into one core if you have only one thread in each block (row) that does the sum and value, for example ...

 __global__ both_kernel(Y, H, X, W) { r = blockIdx.x + Y; c = threadIdx.x + W; chan1value = ... chan2value = ... chan3value = ... if(threadIdx.x == 0) { ch1RowSum = 0; ch2RowSum = 0; ch3RowSum = 0; for(i=0; i<blockDim.x; i++) { ch1RowSum += chan1value; ch2RowSum += chan2value; ch3RowSum += chan3value; } ch1Mean = ch1RowSum / blockDim.x; ch2Mean = ch2RowSum / blockDim.x; ch3Mean = ch3RowSum / blockDim.x; } }

but it’s probably better to use the first core of values, and then the second core for both sums and funds ... You can further parallelize the kernel below, and if it is split, you can focus on it when you are ready .

 __global__ sum_kernel(Y,W) { r = blockIdx.x + Y; ch1RowSum = 0; ch2RowSum = 0; ch3RowSum = 0; for(i=0; i<W; i++) { ch1RowSum += chan1value; ch2RowSum += chan2value; ch3RowSum += chan3value; } ch1Mean = ch1RowSum / W; ch2Mean = ch2RowSum / W; ch3Mean = ch3RowSum / W; }

CUDA: built-in for the core of the loop

More articles: