Consider these three trivial, minimal kernels. Their use in the register is much higher than I expect. Why?
A:
__global__ void Kernel_A() {
corresponds to ptx:
ptxas info : Compiling entry function '_Z8Kernel_Av' for 'sm_20' ptxas info : Function properties for _Z8Kernel_Av 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 2 registers, 32 bytes cmem[0]
IN:
template<uchar effective_bank_width> __global__ void Kernel_B() {
corresponds to ptx:
ptxas info : Compiling entry function '_Z8Kernel_BILh1EEvv' for 'sm_20' ptxas info : Function properties for _Z8Kernel_BILh1EEvv 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 2 registers, 32 bytes cmem[0]
WITH
template<uchar my_val> __global__ void Kernel_C (uchar *const device_prt_in, uchar *const device_prt_out) {
corresponds to ptx:
ptxas info : Compiling entry function '_Z35 Kernel_CILh1EEvPhS0_' for 'sm_20' ptxas info : Function properties for _Z35 Kernel_CILh1EEvPhS0_ 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 10 registers, 48 bytes cmem[0]
Question:
Why do empty kernels A and B use 2 registers? CUDA always uses one implicit register, but why are 2 additional explicit registers used?
The C kernel is even more frustrating. 10 registers? But there are only 2 pointers. This gives 2 * 2 = 4 registers for pointers. Even if there are 2 more mysterious registers (proposed by Core A and Core B), this will only give 6 points. Much less than 10!
In case you are interested, here is the ptx code for Kernel A. The ptx code for Kernel B is exactly the same, modulo integer values and variable names.
.visible .entry _Z8Kernel_Av( ) { .loc 5 19 1 func_begin0: .loc 5 19 0 .loc 5 19 1 func_exec_begin0: .loc 5 22 2 ret; tmp0: func_end0: }
And for the core C ...
.weak .entry _Z35Kernel_CILh1EEvPhS0_( .param .u64 _Z35Kernel_CILh1EEvPhS0__param_0, .param .u64 _Z35Kernel_CILh1EEvPhS0__param_1 ) { .local .align 8 .b8 __local_depot2[16]; .reg .b64 %SP; .reg .b64 %SPL; .reg .s64 %rd<3>; .loc 5 38 1 func_begin2: .loc 5 38 0 .loc 5 38 1 mov.u64 %SPL, __local_depot2; cvta.local.u64 %SP, %SPL; ld.param.u64 %rd1, [_Z35Kernel_CILh1EEvPhS0__param_0]; ld.param.u64 %rd2, [_Z35Kernel_CILh1EEvPhS0__param_1]; st.u64 [%SP+0], %rd1; st.u64 [%SP+8], %rd2; func_exec_begin2: .loc 5 836 2 tmp2: ret; tmp3: func_end2: }
- Why is a local memory variable (
.local ) declared first? - Why are two pointers (specified as function arguments) stored in registers? Is there no special space for them?
- Perhaps the two function argument pointers are case-
.reg .b64 - this explains the two lines of .reg .b64 . But what is a .reg .s64 line? Why is he there?
Worse:
D:
template<uchar my_val> __global__ void Kernel_D (uchar * device_prt_in, uchar *const device_prt_out) { device_prt_in = device_prt_in + blockIdx.x*blockDim.x + threadIdx.x; }
gives
ptxas info : Used 6 registers, 48 bytes cmem[0]
Thus, the manipulation of the argument (pointer) is reduced from 10 to 6 registers?