Suppose you have a kernel that computes three values. Each stream in your configuration will calculate three values ββfor each pair (r, c).
__global__ value_kernel(Y, H, X, W) { r = blockIdx.x + Y; c = threadIdx.x + W; chan1value = ... chan2value = ... chan3value = ... }
I do not believe that you can calculate the sum (completely parallel, at least) in the specified kernel. You cannot use + = as you are already higher. You can put all this into one core if you have only one thread in each block (row) that does the sum and value, for example ...
__global__ both_kernel(Y, H, X, W) { r = blockIdx.x + Y; c = threadIdx.x + W; chan1value = ... chan2value = ... chan3value = ... if(threadIdx.x == 0) { ch1RowSum = 0; ch2RowSum = 0; ch3RowSum = 0; for(i=0; i<blockDim.x; i++) { ch1RowSum += chan1value; ch2RowSum += chan2value; ch3RowSum += chan3value; } ch1Mean = ch1RowSum / blockDim.x; ch2Mean = ch2RowSum / blockDim.x; ch3Mean = ch3RowSum / blockDim.x; } }
but itβs probably better to use the first core of values, and then the second core for both sums and funds ... You can further parallelize the kernel below, and if it is split, you can focus on it when you are ready .
__global__ sum_kernel(Y,W) { r = blockIdx.x + Y; ch1RowSum = 0; ch2RowSum = 0; ch3RowSum = 0; for(i=0; i<W; i++) { ch1RowSum += chan1value; ch2RowSum += chan2value; ch3RowSum += chan3value; } ch1Mean = ch1RowSum / W; ch2Mean = ch2RowSum / W; ch3Mean = ch3RowSum / W; }
source share