CUDA Register Limit: __launch_bounds__ vs maxrregcount

From the NVIDIA CUDA C Programming Guide :

Registry usage can be controlled using the maxrregcountoption compiler or launch, as described in launch rockets.

From my understanding (and correct me if I'm wrong), while -maxrregcountlimiting the number of registers that can be used the entire file .cu, the determinant __launch_bounds__determines maxThreadsPerBlockand minBlocksPerMultiprocessorfor each core __global__. These two perform the same task, but in two different ways.

My use requires that I have 40registers for each thread in order to maximize performance. That way I can use -maxrregcount 40. I can also force registers to 40use __launch_bounds__(256, 6), but this causes spills in load and storage.

What is the difference between the two in order to cause these register spills?

+4
source share
1 answer

The preface to this question is that, quoting CUDA C Programming Guide,

the fewer registers the kernel uses, the more threads and thread blocks are likely to be on the multiprocessor, which can improve performance.

__launch_bounds__ maxregcount .

__launch_bounds__

nvcc , __global__, . -, " " . , () , .

#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP     2

__global__ void
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
fooKernel(int *inArr, int *outArr)
{
    // ... Computation of kernel
}

, nvcc "" .

MAX_THREADS_PER_BLOCK , MIN_BLOCKS_PER_MP . , ​​ , MAX_THREADS_PER_BLOCK, .

Programming Guide :

, L , ​​ , minBlocksPerMultiprocessor ( minBlocksPerMultiprocessor ) maxThreadsPerBlock . :

  • , L, , L, / ;

, __launch_bounds__ .

maxrregcount

maxrregcount - , , , __launch_bounds__, . , , DRAM. DRAM, L1, L2.

+7

Source: https://habr.com/ru/post/1679901/


All Articles