Declaring Variables in the CUDA Core

Let's say you declare a new variable in the CUDA core, and then use it in multiple threads, for example:

__global__ void kernel(float* delt, float* deltb) { int i = blockIdx.x * blockDim.x + threadIdx.x; float a; a = delt[i] + deltb[i]; a += 1; } 

and the kernel call looks something like the following, with several threads and blocks:

 int threads = 200; uint3 blocks = make_uint3(200,1,1); kernel<<<blocks,threads>>>(d_delt, d_deltb); 
  • Is "a" stored on the stack?
  • Is a new "a" created for each thread during initialization?
  • Or will each thread independently access "a" at an unknown time, potentially ruining the algorithm?
+4
source share
2 answers

None of the above. The CUDA compiler is smart enough and aggressive enough with the optimization that it can detect that a not being used, and the full code can be optimized. You can confirm this by compiling the kernel with -Xptxas=-v as an option and look at the resource account, which should be mostly case-insensitive and have no local memory or heap.

In a less trivial example, a is likely to be stored in each stream register or in the local local memory of the stream that is outside of DRAM.

+5
source

Any variable (scalar or array) declared inside the kernel function, without the extern specifier, is local to each thread, that is, each thread has its own “copy” of this variable, no data race between the threads will happen!

The compiler chooses whether local variables will be in registers or in local memory (actually global memory), depending on the transformations and optimizations performed by the compiler.

More information about which variables are stored in local memory can be found in the NVIDIA CUDA User Guide, chapter 5.3.2.2.

+7
source

Source: https://habr.com/ru/post/1494139/


All Articles