CUDA shared memory array - odd behavior

In the CUDA core, I have code similar to the following. I try to calculate one numerator per stream and accumulate the numerators on the block to calculate the denominator, and then return the coefficient. However, CUDA sets the denom value to any value that is calculated for the number in the stream with the largest threadIdx.x value, and not the sum of the number calculated for all threads in the block. Does anyone know what is going on?

extern __shared__ float s_shared[];

float numer = //calculate numerator

s_shared[threadIdx.x] = numer;
s_shared[blockDim.x] += numer;
__syncthreads();

float denom = s_shared[blockDim.x];
float result = numer/denom;

The "result" should always be between 0 and 1 and should be summed to 1 per block, but instead it is 1.0 for each thread, where threadIdx.x is the maximum and the other value is not limited to the range for other threads in the block.

+3
1

blockDim.x. , , , .

  • ,
  • , + .
  • Everone +

threadId b/c, , , . .

, , , , s_shared[threadIdx.x]

  • ..

O (n) O (log n).

+4
source

Source: https://habr.com/ru/post/1711700/


All Articles