I want to call different instances of the CUDA template core with dynamically allocated shared memory in one program. My first naive approach was to write:
template<typename T> __global__ void kernel(T* ptr) { extern __shared__ T smem[];
However, this code cannot be compiled. nvcc tells me the following error message:
main.cu(4): error: declaration is incompatible with previous "smem" (4): here detected during: instantiation of "void kernel(T *) [with T=double]" (12): here instantiation of "void call_kernel(T *, int) [with T=double]" (24): here
I understand that I ran into a name conflict because shared memory is declared as extern. However, there is no way around this if I want to determine its size at runtime, as far as I know.
So my question is: Is there an elegant way to get the desired behavior? With elegant, I mean no code duplication, etc.
source share