CUDA Template Core with Dynamic Shared Memory

I want to call different instances of the CUDA template core with dynamically allocated shared memory in one program. My first naive approach was to write:

template<typename T> __global__ void kernel(T* ptr) { extern __shared__ T smem[]; // calculations here ... } template<typename T> void call_kernel( T* ptr, const int n ) { dim3 dimBlock(n), dimGrid; kernel<<<dimGrid, dimBlock, n*sizeof(T)>>>(ptr); } int main(int argc, char *argv[]) { const int n = 32; float *float_ptr; double *double_ptr; cudaMalloc( (void**)&float_ptr, n*sizeof(float) ); cudaMalloc( (void**)&double_ptr, n*sizeof(double) ); call_kernel( float_ptr, n ); call_kernel( double_ptr, n ); // problem, 2nd instantiation cudaFree( (void*)float_ptr ); cudaFree( (void*)double_ptr ); return 0; } 

However, this code cannot be compiled. nvcc tells me the following error message:

 main.cu(4): error: declaration is incompatible with previous "smem" (4): here detected during: instantiation of "void kernel(T *) [with T=double]" (12): here instantiation of "void call_kernel(T *, int) [with T=double]" (24): here 

I understand that I ran into a name conflict because shared memory is declared as extern. However, there is no way around this if I want to determine its size at runtime, as far as I know.

So my question is: Is there an elegant way to get the desired behavior? With elegant, I mean no code duplication, etc.

+6
source share
2 answers

Dynamically allocated shared memory actually represents only the size (in bytes) and a pointer to the kernel. So something like this should work:

replace this:

 extern __shared__ T smem[]; 

with this:

 extern __shared__ __align__(sizeof(T)) unsigned char my_smem[]; T *smem = reinterpret_cast<T *>(my_smem); 

In the programming guide, you can see other examples of re-cropping dynamically allocated shared memory pointers that might satisfy other needs.

EDIT: Updated my answer to reflect @njuffa's comment.

+13
source

( Answer option @RobertCrovella)

NVCC does not want to accept two extern __shared__ arrays with the same name, but of different types - even if they are never within the scope of each other. We will need to satisfy NVCC that all our template instances will use the same type for shared memory under the hood, and let the kernel code using them see the type that he likes.

Therefore, we will replace this instruction:

 extern __shared__ T smem[]; 

with this:

 auto smem = shared_memory_proxy<T>(); 

Where:

 template <typename T> __device__ T* shared_memory_proxy() { // do we need an __align__() here? I don't think so... extern __shared__ unsigned char memory[]; return reinterpret_cast<T*>(memory); } 

is in some enable code on the device side.

Benefits:

  • Single line on site use.
  • Simple syntax to remember.
  • Separation of problems - the one who reads the kernel should not think about why he / she sees extern , or alignment specifiers, or reinterpretation, etc.

editing : this is implemented as part of my library of tools for developing the CUDA core for header only : shared_memory.cuh (where it is called shared_memory::dynamic::proxy() ).

+2
source

Source: https://habr.com/ru/post/979939/


All Articles