Memory sharing optimization confusion

I wrote an application in cuda that uses 1kb of shared memory in each block. Since each SM has only 16 KB of shared memory, so only 16 blocks can be allocated as a whole (can I understand this correctly?), Although only 8 can be scheduled at a time, but now if some block is busy with a memory operation , so another block will be scheduled on gpu, but all shared memory is used by the other 16 blocks that were already planned there, so cuda will not schedule more blocks on the same sm, unless the previously allocated blocks are completely finished? or it will move some shared block memory to global memory and place another block there (in this case, should we worry about delayed access to global memory?)

+4
source share
1 answer

This does not work. The number of blocks that you plan to launch at any time on one SM will always be minimal:

  • 8 blocks
  • The number of blocks, the sum of the static and dynamically allocated shared memory is less than 16 kb or 48 kb, depending on the architecture and settings of the GPU. There are also page limits for shared memory, which means that for each block distribution, it is rounded to the next largest multiple page size.
  • The number of blocks whose amount to use in the block register is less than 8192/16384/32678, depending on the architecture. There are also page sizes in register files, which means that for each block distribution, it is rounded to the next largest multiple page size.

That's all. There is no β€œpaging” of shared memory to accommodate more blocks. NVIDIA creates an employment calculation spreadsheet that comes with the toolkit and is available as a separate download. You can see the exact rules in the formulas contained in it. They are also discussed in section 4.2 of the CUDA Programming Guide.

+7
source

Source: https://habr.com/ru/post/1347425/


All Articles