Cuda: built-in device function and various .cu files

Two facts: CUDA 5.0 allows you to compile CUDA code in different object files for later linking. CUDA 2.x architecture no longer automatically embeds features.

As usual in C / C ++, I implemented the __device__ int foo() functions.cu in functions.cu and put its title in functions.hu . The foo function is called in other CUDA source files.

When I examine functions.ptx , I see that foo() spills into local memory. For testing purposes, I commented all the foo() meat and just did it return 1; Something is still spilling into local memory according to .ptx . (I can’t imagine what it is, because the function does nothing!)

However, when I move the implementation of foo() to the functions.hu header file and add the __forceinline__ qualifier, then nothing is written to the local memory!

What's going on here? Why doesn't CUDA perform such a simple function automatically?

The whole point of individual header and implementation files is to make working with code easier. But if I need to put a bunch of functions (or all of them) in the header and __forceinline__ them, then this partially hits the goal of CUDA 5.0 different compilation units ...

Is there any way around this?


A simple, real-world example:

functions.cu:

 __device__ int foo (const uchar param0, const uchar *const param1, const unsigned short int param2, const unsigned short int param3, const uchar param4) { return 1; //real code commented out. } 

The above function is poured into local memory.

functions.ptx:

 .visible .func (.param .b32 func_retval0) _Z45fooPKhth( .param .b32 _Z45foohPKhth_param_0, .param .b64 _Z45foohPKhth_param_1, .param .b32 _Z45foohPKhth_param_2, .param .b32 _Z45foohPKhth_param_3 ) { .local .align 8 .b8 __local_depot72[24]; .reg .b64 %SP; .reg .b64 %SPL; .reg .s16 %rc<3>; .reg .s16 %rs<4>; .reg .s32 %r<2>; .reg .s64 %rd<2>; 
+4
source share
1 answer

Not all use of local memory is a spill. Functions called must follow ABI calling conventions, which include creating a stack frame that resides in local memory. When nvcc is passed the command line switch -Xptxas -v, the compiler reports the use of the stack and its spill as a subcomponent.

Currently (CUDA 5.0), the CUDA tool binding does not support a function nested across the boundaries of compilation units, as some host compilers do. Thus, there is a trade-off between the flexibility of a separate compilation (such as recompiling only a small part of a large project with a long compilation time and the ability to create libraries on the device side) and the performance gain that usually occurs due to a function (for example, eliminating overhead due to ABI calling conventions, which allows you to further optimize, for example, constant scrolling along the boundaries of functions).

The function of embedding in a single compilation unit is controlled by the compiler heuristic, which attempts to interpret whether inlining can be cost-effective in terms of performance (if possible at all). This means that not all functions can be integrated. Programmers can override heuristics with the __forcinline__ and __noinline__ function attributes.

+3
source

Source: https://habr.com/ru/post/1486089/


All Articles