Preamble: Suppose I use an NVIDIA GTX480 card in CUDA. The theoretical peak global memory bandwidth for this card is 177.4 GB / s: 384 * 2 * 1848/8 * 1E9 = 177.4 GB / s
384 comes from the width of the memory interface, 2 forms the nature of the DDR memory, 1848 - the memory clock frequency (in MHz), 8 - because I want to get my answer in bytes.
Something similar can be calculated for shared memory: 4 bytes per bank * 32 banks * 0.5 banks per cycle * 1400 MHz * 15 SMs = 1.344 GB / s
The number is higher than the factors in the number of SMs, that is 15. Thus, to achieve this maximum throughput of shared memory, I need all 15 SMs to read shared memory.
MY QUESTION: To achieve the maximum throughput of global memory, is it enough to have only one SM-read from global memory, or will all SMs try to read from global memory at the same time? In particular, imagine that I am running a single-block kernel with 32 threads. Then, if I have a single deformation on SM-0, and everything I do in the kernel is read non-stop from global memory in combined mode, will I reach 177.4 GB / s? Or should I run at least 15 blocks, each of which contains 32 threads, so that 15 distortions on SM-0 through SM-14 try to read at the same time?
The immediate task would probably be to run a test test to figure this out. I would, however, like to understand why this is happening.
source share