How to determine which CUDA lines use most registers?

I have a somewhat complex kernel with the following characteristics:

ptxas info : Compiling entry function 'my_kernel' for 'sm_21' ptxas info : Function properties for my_kernel 32 bytes stack frame, 64 bytes spill stores, 40 bytes spill loads ptxas info : Used 62 registers, 120 bytes cmem[0], 128 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16] 

It is not clear to me which part of the core is the "high water mark" in terms of register use. The nature of the kernel is such that dividing the various parts into constant values โ€‹โ€‹leads to the fact that the optimizer has longer time parts, etc. (At least, as it seems, since the numbers that I return when I do this make sense).

The CUDA profiler is also useless AFAICT, just telling me that I have pressure in the register.

Is there a way to get more information on using register? I would prefer some kind of tool, but I would also be interested to hear about what you need to dig into a compiled binary if necessary.

Edit: it is possible for me to approach this upstream (e.g. make experimental code changes, check the effect on case usage, etc.), but I would prefer to start from top to bottom or at least get some guidance on where start a bottom-up study.

+6
source share
1 answer

You can feel the complexity of compiler output by compiling to annotated PTX as follows:

 nvcc -ptx -Xopencc="-LIST:source=on" branching.cu 

which produces a PTX assembler file with the C source code inside it as comments:

  .entry _Z11branchTest0PfS_S_ ( .param .u64 __cudaparm__Z11branchTest0PfS_S__a, .param .u64 __cudaparm__Z11branchTest0PfS_S__b, .param .u64 __cudaparm__Z11branchTest0PfS_S__d) { .reg .u16 %rh<4>; .reg .u32 %r<5>; .reg .u64 %rd<10>; .reg .f32 %f<5>; .loc 16 1 0 // 1 __global__ void branchTest0(float *a, float *b, float *d) $LDWbegin__Z11branchTest0PfS_S_: .loc 16 7 0 // 3 unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x; // 4 float aval = a[tidx], bval = b[tidx]; // 5 float z0 = (aval > bval) ? aval : bval; // 6 // 7 d[tidx] = z0; mov.u16 %rh1, %ctaid.x; mov.u16 %rh2, %ntid.x; mul.wide.u16 %r1, %rh1, %rh2; cvt.u32.u16 %r2, %tid.x; add.u32 %r3, %r2, %r1; cvt.u64.u32 %rd1, %r3; mul.wide.u32 %rd2, %r3, 4; ld.param.u64 %rd3, [__cudaparm__Z11branchTest0PfS_S__a]; add.u64 %rd4, %rd3, %rd2; ld.global.f32 %f1, [%rd4+0]; ld.param.u64 %rd5, [__cudaparm__Z11branchTest0PfS_S__b]; add.u64 %rd6, %rd5, %rd2; ld.global.f32 %f2, [%rd6+0]; max.f32 %f3, %f1, %f2; ld.param.u64 %rd7, [__cudaparm__Z11branchTest0PfS_S__d]; add.u64 %rd8, %rd7, %rd2; st.global.f32 [%rd8+0], %f3; .loc 16 8 0 // 8 } exit; $LDWend__Z11branchTest0PfS_S_: } // _Z11branchTest0PfS_S_ 

Note that this directly tells you nothing about how to use case, since PTX uses a static single purpose, but it shows you that the assembler is set as input and how it relates to your source code. With the CUDA 4.0 toolkit, you can compile C into a cubic file for the Fermi architecture:

 $ nvcc -cubin -arch=sm_20 -Xptxas="-v" branching.cu ptxas info : Compiling entry function '_Z11branchTest1PfS_S_' for 'sm_20' ptxas info : Function properties for _Z11branchTest1PfS_S_ 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 

and use the cuobjdump utility to dismantle the machine code generated by the assembler.

 $ cuobjdump -sass branching.cubin code for sm_20 Function : _Z11branchTest0PfS_S_ /*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; /*0008*/ /*0x94001c042c000000*/ S2R R0, SR_CTAid_X; /*0010*/ /*0x84009c042c000000*/ S2R R2, SR_Tid_X; /*0018*/ /*0x10015de218000000*/ MOV32I R5, 0x4; /*0020*/ /*0x2000dc0320044000*/ IMAD.U32.U32 R3, R0, c [0x0] [0x8], R2; /*0028*/ /*0x10311c435000c000*/ IMUL.U32.U32.HI R4, R3, 0x4; /*0030*/ /*0x80319c03200b8000*/ IMAD.U32.U32 R6.CC, R3, R5, c [0x0] [0x20]; /*0038*/ /*0x9041dc4348004000*/ IADD.X R7, R4, c [0x0] [0x24]; /*0040*/ /*0xa0321c03200b8000*/ IMAD.U32.U32 R8.CC, R3, R5, c [0x0] [0x28]; /*0048*/ /*0x00609c8584000000*/ LD.E R2, [R6]; /*0050*/ /*0xb0425c4348004000*/ IADD.X R9, R4, c [0x0] [0x2c]; /*0058*/ /*0xc0329c03200b8000*/ IMAD.U32.U32 R10.CC, R3, R5, c [0x0] [0x30]; /*0060*/ /*0x00801c8584000000*/ LD.E R0, [R8]; /*0068*/ /*0xd042dc4348004000*/ IADD.X R11, R4, c [0x0] [0x34]; /*0070*/ /*0x00201c00081e0000*/ FMNMX R0, R2, R0, !pt; /*0078*/ /*0x00a01c8594000000*/ ST.E [R10], R0; /*0080*/ /*0x00001de780000000*/ EXIT; ...................................... 

As a rule, you can track back from assembler to PTX and get at least a rough idea of โ€‹โ€‹where the "greedy" sections of code are. Having said all this, register pressure management is one of the most difficult aspects of CUDA programming at the moment. If / when NVIDIA ever documents its ELF format for device code, I believe that the right code analysis tool would be a great project for someone.

+9
source

Source: https://habr.com/ru/post/895446/


All Articles