Cuda - minimal example, using high case

Question

Cuda - minimal example, using high case

Consider these three trivial, minimal kernels. Their use in the register is much higher than I expect. Why?

A:

__global__ void Kernel_A() { //empty }

corresponds to ptx:

 ptxas info : Compiling entry function '_Z8Kernel_Av' for 'sm_20' ptxas info : Function properties for _Z8Kernel_Av 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 2 registers, 32 bytes cmem[0]

IN:

 template<uchar effective_bank_width> __global__ void Kernel_B() { //empty } template __global__ void Kernel_B<1>();

corresponds to ptx:

 ptxas info : Compiling entry function '_Z8Kernel_BILh1EEvv' for 'sm_20' ptxas info : Function properties for _Z8Kernel_BILh1EEvv 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 2 registers, 32 bytes cmem[0]

WITH

 template<uchar my_val> __global__ void Kernel_C (uchar *const device_prt_in, uchar *const device_prt_out) { //empty }

corresponds to ptx:

 ptxas info : Compiling entry function '_Z35 Kernel_CILh1EEvPhS0_' for 'sm_20' ptxas info : Function properties for _Z35 Kernel_CILh1EEvPhS0_ 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 10 registers, 48 bytes cmem[0]

Question:

Why do empty kernels A and B use 2 registers? CUDA always uses one implicit register, but why are 2 additional explicit registers used?

The C kernel is even more frustrating. 10 registers? But there are only 2 pointers. This gives 2 * 2 = 4 registers for pointers. Even if there are 2 more mysterious registers (proposed by Core A and Core B), this will only give 6 points. Much less than 10!

In case you are interested, here is the ptx code for Kernel A. The ptx code for Kernel B is exactly the same, modulo integer values and variable names.

 .visible .entry _Z8Kernel_Av( ) { .loc 5 19 1 func_begin0: .loc 5 19 0 .loc 5 19 1 func_exec_begin0: .loc 5 22 2 ret; tmp0: func_end0: }

And for the core C ...

 .weak .entry _Z35Kernel_CILh1EEvPhS0_( .param .u64 _Z35Kernel_CILh1EEvPhS0__param_0, .param .u64 _Z35Kernel_CILh1EEvPhS0__param_1 ) { .local .align 8 .b8 __local_depot2[16]; .reg .b64 %SP; .reg .b64 %SPL; .reg .s64 %rd<3>; .loc 5 38 1 func_begin2: .loc 5 38 0 .loc 5 38 1 mov.u64 %SPL, __local_depot2; cvta.local.u64 %SP, %SPL; ld.param.u64 %rd1, [_Z35Kernel_CILh1EEvPhS0__param_0]; ld.param.u64 %rd2, [_Z35Kernel_CILh1EEvPhS0__param_1]; st.u64 [%SP+0], %rd1; st.u64 [%SP+8], %rd2; func_exec_begin2: .loc 5 836 2 tmp2: ret; tmp3: func_end2: }

Why is a local memory variable ( .local ) declared first?
Why are two pointers (specified as function arguments) stored in registers? Is there no special space for them?
Perhaps the two function argument pointers are case- .reg .b64 - this explains the two lines of .reg .b64 . But what is a .reg .s64 line? Why is he there?

Worse:

D:

 template<uchar my_val> __global__ void Kernel_D (uchar * device_prt_in, uchar *const device_prt_out) { device_prt_in = device_prt_in + blockIdx.x*blockDim.x + threadIdx.x; }

gives

 ptxas info : Used 6 registers, 48 bytes cmem[0]

Thus, the manipulation of the argument (pointer) is reduced from 10 to 6 registers?

+6

optimization assembly gpu cuda ptx

cmo Jun 20 '13 at 14:16

source share

1 answer

talonmies · Accepted Answer · 2013-06-20T16:09:58+0000

The first thing to do is that if you are worried about registers, do not look at the PTX code because it will not tell you anything. PTX uses a static unified assignment form, and the code emitted by the compiler does not include any “decoration” necessary to create an entry point into the executable machine code.

With that in mind, let's look at core A:

 $ nvcc -arch=sm_20 -m64 -cubin -Xptxas=-v null.cu ptxas info : 0 bytes gmem ptxas info : Compiling entry function '_Z8Kernel_Av' for 'sm_20' ptxas info : Function properties for _Z8Kernel_Av 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 2 registers, 32 bytes cmem[0] $ cuobjdump -sass null.cubin code for sm_20 Function : _Z8Kernel_Av /*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; /*0008*/ /*0x00001de780000000*/ EXIT; .............................

There are two registers. Empty kernels give no null instructions.

Other than that, I cannot reproduce what you showed. If I look at your C core as published, I get this (CUDA 5 release compiler):

 $ nvcc -arch=sm_20 -m64 -cubin -Xptxas=-v null.cu ptxas info : 0 bytes gmem ptxas info : Compiling entry function '_Z8Kernel_CILh1EEvPhS0_' for 'sm_20' ptxas info : Function properties for _Z8Kernel_CILh1EEvPhS0_ 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 2 registers, 48 bytes cmem[0] $ cuobjdump -sass null.cubin code for sm_20 Function : _Z8Kernel_CILh1EEvPhS0_ /*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; /*0008*/ /*0x00001de780000000*/ EXIT; ........................................

i.e. identical register code 2 for the first two cores.

and the same for D kernel:

 $ nvcc -arch=sm_20 -m64 -cubin -Xptxas=-v null.cu ptxas info : 0 bytes gmem ptxas info : Compiling entry function '_Z8Kernel_DILh1EEvPhS0_' for 'sm_20' ptxas info : Function properties for _Z8Kernel_DILh1EEvPhS0_ 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 2 registers, 48 bytes cmem[0] $ cuobjdump -sass null.cubin code for sm_20 Function : _Z8Kernel_DILh1EEvPhS0_ /*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; /*0008*/ /*0x00001de780000000*/ EXIT; ........................................

Again, 2 registers.

For the record, the version of nvcc that I am using is:

 $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2012 NVIDIA Corporation Built on Fri_Sep_28_16:10:16_PDT_2012 Cuda compilation tools, release 5.0, V0.2.1221

Cuda - minimal example, using high case

More articles: