Reaching global GPU global bandwidth

Question

Reaching global GPU global bandwidth

Preamble: Suppose I use an NVIDIA GTX480 card in CUDA. The theoretical peak global memory bandwidth for this card is 177.4 GB / s: 384 * 2 * 1848/8 * 1E9 = 177.4 GB / s

384 comes from the width of the memory interface, 2 forms the nature of the DDR memory, 1848 - the memory clock frequency (in MHz), 8 - because I want to get my answer in bytes.

Something similar can be calculated for shared memory: 4 bytes per bank * 32 banks * 0.5 banks per cycle * 1400 MHz * 15 SMs = 1.344 GB / s

The number is higher than the factors in the number of SMs, that is 15. Thus, to achieve this maximum throughput of shared memory, I need all 15 SMs to read shared memory.

MY QUESTION: To achieve the maximum throughput of global memory, is it enough to have only one SM-read from global memory, or will all SMs try to read from global memory at the same time? In particular, imagine that I am running a single-block kernel with 32 threads. Then, if I have a single deformation on SM-0, and everything I do in the kernel is read non-stop from global memory in combined mode, will I reach 177.4 GB / s? Or should I run at least 15 blocks, each of which contains 32 threads, so that 15 distortions on SM-0 through SM-14 try to read at the same time?

The immediate task would probably be to run a test test to figure this out. I would, however, like to understand why this is happening.

+4

max memory gpu bandwidth global

user1586099 10 sept. '12 at 23:40

source share

1 answer

ahmad · Answer 1 · 2012-09-11T06:47:03+0000

As far as I know, the GPU network card is a crossbar of TPC and memory controllers. Therefore, theoretically, one SM can alternate memory images between different memory controllers to achieve full global throughput. But note that each crossbar interface has a buffer, and if these buffers are not large enough, the memory instructions in the active SM may stall. Moreover, each SM has limited storage capacity for outstanding memory accesses. These problems can limit the memory bandwidth that SM can use. So, I think the answer to your question requires some microbenchmarking , and I think that one SM cannot use all the global memory bandwidth.

Reaching global GPU global bandwidth

More articles: