Two facts: CUDA 5.0 allows you to compile CUDA code in different object files for later linking. CUDA 2.x architecture no longer automatically embeds features.
As usual in C / C ++, I implemented the __device__ int foo() functions.cu in functions.cu and put its title in functions.hu . The foo function is called in other CUDA source files.
When I examine functions.ptx , I see that foo() spills into local memory. For testing purposes, I commented all the foo() meat and just did it return 1; Something is still spilling into local memory according to .ptx . (I canβt imagine what it is, because the function does nothing!)
However, when I move the implementation of foo() to the functions.hu header file and add the __forceinline__ qualifier, then nothing is written to the local memory!
What's going on here? Why doesn't CUDA perform such a simple function automatically?
The whole point of individual header and implementation files is to make working with code easier. But if I need to put a bunch of functions (or all of them) in the header and __forceinline__ them, then this partially hits the goal of CUDA 5.0 different compilation units ...
Is there any way around this?
A simple, real-world example:
functions.cu:
__device__ int foo (const uchar param0, const uchar *const param1, const unsigned short int param2, const unsigned short int param3, const uchar param4) { return 1;
The above function is poured into local memory.
functions.ptx:
.visible .func (.param .b32 func_retval0) _Z45fooPKhth( .param .b32 _Z45foohPKhth_param_0, .param .b64 _Z45foohPKhth_param_1, .param .b32 _Z45foohPKhth_param_2, .param .b32 _Z45foohPKhth_param_3 ) { .local .align 8 .b8 __local_depot72[24]; .reg .b64 %SP; .reg .b64 %SPL; .reg .s16 %rc<3>; .reg .s16 %rs<4>; .reg .s32 %r<2>; .reg .s64 %rd<2>;