CUDA Register Limit: __launch_bounds__ vs maxrregcount

Question

CUDA Register Limit: __launch_bounds__ vs maxrregcount

From the NVIDIA CUDA C Programming Guide :

Registry usage can be controlled using the maxrregcountoption compiler or launch, as described in launch rockets.

From my understanding (and correct me if I'm wrong), while -maxrregcountlimiting the number of registers that can be used the entire file .cu, the determinant __launch_bounds__determines maxThreadsPerBlockand minBlocksPerMultiprocessorfor each core __global__. These two perform the same task, but in two different ways.

My use requires that I have 40registers for each thread in order to maximize performance. That way I can use -maxrregcount 40. I can also force registers to 40use __launch_bounds__(256, 6), but this causes spills in load and storage.

What is the difference between the two in order to cause these register spills?

+4

gpu cuda nvcc nvidia

Kelsius Jun 22 '17 at 16:07

source share

1 answer

JackOLantern · Accepted Answer · 2017-06-22T21:16:41+0000

The preface to this question is that, quoting CUDA C Programming Guide,

the fewer registers the kernel uses, the more threads and thread blocks are likely to be on the multiprocessor, which can improve performance.

__launch_bounds__ maxregcount .

__launch_bounds__

nvcc , __global__, . -, " " . , () , .

#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP     2

__global__ void
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
fooKernel(int *inArr, int *outArr)
{
    // ... Computation of kernel
}

, nvcc "" .

MAX_THREADS_PER_BLOCK , MIN_BLOCKS_PER_MP . , , MAX_THREADS_PER_BLOCK, .

Programming Guide :

, L , , minBlocksPerMultiprocessor ( minBlocksPerMultiprocessor ) maxThreadsPerBlock . :
, L, , L, / ;

, __launch_bounds__ .

maxrregcount

maxrregcount - , , , __launch_bounds__, . , , DRAM. DRAM, L1, L2.

CUDA Register Limit: __launch_bounds__ vs maxrregcount

More articles: