CUDA main pipeline

I am following this article on the prediction model for the GPU. The second column, a little closer to the end, indicates

Finally, you need to make sure that each of the Nc (SM) cores in SM on the GPU has a D-deep pipeline that has the effect of executing D threads in parallel.

My question is related to the D-deep pipeline . What does this pipeline look like? Is it something like a processor pipeline (I mean only the idea, because GPU-CPUs are completely different architectures) about extracting, decoding, executing, reversing?

Is there a document where this is documented?

+4
source share
2 answers

Yes, the SM GPU pipeline is similar to a processor. The difference lies in the proportions of the frontend / backend pipeline: the GPU has one fetch / decode and many small ALUs (I think there are 32 parallel Execute sub-exports), grouped as “Cuda kernels” inside SM. This is similar to superscalar processors (for example, Core-i7 has 6-8 release ports, one port per independent ALU pipeline).

There is a GTX 460 SM (image from destructoid.com , we can even see that inside each CUDA core there are two pipelines: a send port, then an Operand collector, then two parallel blocks, one for Int and the other for FP and the result queue): GTX 460 SM

(or a better quality image http://www.legitreviews.com/images/reviews/1193/sm.jpg from http://www.legitreviews.com/article/1193/2/ )

We see that in this SM there is one instruction cache, two warp schedulers and 4 dispatch units. And there is one register file. Thus, the first stages of the SM GPU pipeline are a common SM resource. After planning the team, they are sent to the CUDA kernels, and each core can have its own multi-stage (pipelined) ALU, especially for complex operations.

The length of the pipeline is hidden inside the architecture, but I assume that the total depth of the pipeline is much greater than 4. (There are clear instructions with 4 timer delays so that the ALU pipeline is> = 4 stages, and the total depth of the SM pipeline assumes more than 20 stages: https : //devtalk.nvidia.com/default/topic/390366/instruction-latency/ )

There is additional information about complete latent instructions: https://devtalk.nvidia.com/default/topic/419456/how-to-schedule-warps-/ - 24-28 hours for SP and 48-52 hours for DP.

Anandtech posted several snapshots of the AMD GPU, and we can assume that the basic ideas of pipelining should be the same for both suppliers: http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd -architects-for-compute / 4

AMD core according to Anandtech

Thus, fetching, decoding, and branching are common to all SIMD cores, and there are many ALU pipelines. In AMD, the register file is segmented between ALU groups, and in Nvidia it is displayed as a unit (but it can be implemented as segmented and accessed via the netwoork interconnect)

As stated in this work

However, fine-grained parallelism is what sets the GPU apart. Recall that threads execute synchronously in bundles known as strains. GPUs work most efficiently when the number of distortions in flight is large. Although only one warp can be serviced per cycle (Fermi technically maintains two half-warps per shader cycle), the SM scheduler will immediately switch to another active hyphen when a danger occurs. If the command flow generated by the CUDA compiler expresses ILP 3.0 (that is, on average three commands can be executed to danger), and the pipeline depth is 22 stages , only eight active skews (22/3) may be sufficient to fully hide latency of teams and achieve maximum arithmetic bandwidth. The latency of the GPU provides a good use of the enormous resources of the GPU with a small load on the programmer.

Thus, only one shaft at a time will be sent every hour from the conveyor pipeline (SM scheduler), and there is some delay between the scheduler manager and the ALU calculation completion time.

There is a part of the image from Realworldtech http://www.realworldtech.com/cayman/5/ and http://www.realworldtech.com/cayman/11/ with the Fermi pipeline. Note the note [16] in each ALU / FPU - this means that there are 16 of the same ALU physically.

fermi pipeline according to Realwordtech

+9
source

The regular level of the parallelism level appears in the SM GPU when several bases are available to execute it. Hardware multithreaded description is described here.

The paper is quite old and has a GTX 280 GPU. GPUs prior to the Fermi generation had an SM processing layout that was slightly different from the SM circuitry in Fermi and later GPUs. The effect of the high-level processing is the same - 32 threads in warp run in "lockstep", but while later SMs have at least 32 SP (cores) per SM, GPUs before the Fermi generation had fewer cores per SM - usually 8. The effect is that a given warp command is executed in stages, and each “core” or “SP” actually processes several bands within the warp (in stages) to process a specific warp instruction. I believe (based on what I see in the document) that this is a "referenced" pipeline. In fact, each “core” in the GTX 280 has a “4-meter pipeline” that processes 4 threads (due to deformation) and therefore requires 4 clock cycles (minimum) to actually complete the processing of 4 threads in the strain that are assigned to it. This is described here , and you can compare the description with that given for future generations of the GPU, for example, the description of cc 2.0 given here .

And yes, for those who will argue with my use of "cores" and "SP", I agree that this is an inadequate description of how computing resources are calculated on SM GPUs, but I believe that this description is in accordance with the literature NVIDIA marketing and training, and is consistent with how the term “core” or “SP” is used in the reference article.

+2
source

Source: https://habr.com/ru/post/1482166/


All Articles