You can feel the complexity of compiler output by compiling to annotated PTX as follows:
nvcc -ptx -Xopencc="-LIST:source=on" branching.cu
which produces a PTX assembler file with the C source code inside it as comments:
.entry _Z11branchTest0PfS_S_ ( .param .u64 __cudaparm__Z11branchTest0PfS_S__a, .param .u64 __cudaparm__Z11branchTest0PfS_S__b, .param .u64 __cudaparm__Z11branchTest0PfS_S__d) { .reg .u16 %rh<4>; .reg .u32 %r<5>; .reg .u64 %rd<10>; .reg .f32 %f<5>; .loc 16 1 0 // 1 __global__ void branchTest0(float *a, float *b, float *d) $LDWbegin__Z11branchTest0PfS_S_: .loc 16 7 0 // 3 unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x; // 4 float aval = a[tidx], bval = b[tidx]; // 5 float z0 = (aval > bval) ? aval : bval; // 6 // 7 d[tidx] = z0; mov.u16 %rh1, %ctaid.x; mov.u16 %rh2, %ntid.x; mul.wide.u16 %r1, %rh1, %rh2; cvt.u32.u16 %r2, %tid.x; add.u32 %r3, %r2, %r1; cvt.u64.u32 %rd1, %r3; mul.wide.u32 %rd2, %r3, 4; ld.param.u64 %rd3, [__cudaparm__Z11branchTest0PfS_S__a]; add.u64 %rd4, %rd3, %rd2; ld.global.f32 %f1, [%rd4+0]; ld.param.u64 %rd5, [__cudaparm__Z11branchTest0PfS_S__b]; add.u64 %rd6, %rd5, %rd2; ld.global.f32 %f2, [%rd6+0]; max.f32 %f3, %f1, %f2; ld.param.u64 %rd7, [__cudaparm__Z11branchTest0PfS_S__d]; add.u64 %rd8, %rd7, %rd2; st.global.f32 [%rd8+0], %f3; .loc 16 8 0 // 8 } exit; $LDWend__Z11branchTest0PfS_S_: } // _Z11branchTest0PfS_S_
Note that this directly tells you nothing about how to use case, since PTX uses a static single purpose, but it shows you that the assembler is set as input and how it relates to your source code. With the CUDA 4.0 toolkit, you can compile C into a cubic file for the Fermi architecture:
$ nvcc -cubin -arch=sm_20 -Xptxas="-v" branching.cu ptxas info : Compiling entry function '_Z11branchTest1PfS_S_' for 'sm_20' ptxas info : Function properties for _Z11branchTest1PfS_S_ 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
and use the cuobjdump utility to dismantle the machine code generated by the assembler.
$ cuobjdump -sass branching.cubin code for sm_20 Function : _Z11branchTest0PfS_S_ MOV R1, c [0x1] [0x100]; S2R R0, SR_CTAid_X; S2R R2, SR_Tid_X; MOV32I R5, 0x4; IMAD.U32.U32 R3, R0, c [0x0] [0x8], R2; IMUL.U32.U32.HI R4, R3, 0x4; IMAD.U32.U32 R6.CC, R3, R5, c [0x0] [0x20]; IADD.X R7, R4, c [0x0] [0x24]; IMAD.U32.U32 R8.CC, R3, R5, c [0x0] [0x28]; LD.E R2, [R6]; IADD.X R9, R4, c [0x0] [0x2c]; IMAD.U32.U32 R10.CC, R3, R5, c [0x0] [0x30]; LD.E R0, [R8]; IADD.X R11, R4, c [0x0] [0x34]; FMNMX R0, R2, R0, !pt; ST.E [R10], R0; EXIT; ......................................
As a rule, you can track back from assembler to PTX and get at least a rough idea of โโwhere the "greedy" sections of code are. Having said all this, register pressure management is one of the most difficult aspects of CUDA programming at the moment. If / when NVIDIA ever documents its ELF format for device code, I believe that the right code analysis tool would be a great project for someone.