I wrote an application in cuda that uses 1kb of shared memory in each block. Since each SM has only 16 KB of shared memory, so only 16 blocks can be allocated as a whole (can I understand this correctly?), Although only 8 can be scheduled at a time, but now if some block is busy with a memory operation , so another block will be scheduled on gpu, but all shared memory is used by the other 16 blocks that were already planned there, so cuda will not schedule more blocks on the same sm, unless the previously allocated blocks are completely finished? or it will move some shared block memory to global memory and place another block there (in this case, should we worry about delayed access to global memory?)
source share