Yes, the SM GPU pipeline is similar to a processor. The difference lies in the proportions of the frontend / backend pipeline: the GPU has one fetch / decode and many small ALUs (I think there are 32 parallel Execute sub-exports), grouped as “Cuda kernels” inside SM. This is similar to superscalar processors (for example, Core-i7 has 6-8 release ports, one port per independent ALU pipeline).
There is a GTX 460 SM (image from destructoid.com , we can even see that inside each CUDA core there are two pipelines: a send port, then an Operand collector, then two parallel blocks, one for Int and the other for FP and the result queue): 
(or a better quality image http://www.legitreviews.com/images/reviews/1193/sm.jpg from http://www.legitreviews.com/article/1193/2/ )
We see that in this SM there is one instruction cache, two warp schedulers and 4 dispatch units. And there is one register file. Thus, the first stages of the SM GPU pipeline are a common SM resource. After planning the team, they are sent to the CUDA kernels, and each core can have its own multi-stage (pipelined) ALU, especially for complex operations.
The length of the pipeline is hidden inside the architecture, but I assume that the total depth of the pipeline is much greater than 4. (There are clear instructions with 4 timer delays so that the ALU pipeline is> = 4 stages, and the total depth of the SM pipeline assumes more than 20 stages: https : //devtalk.nvidia.com/default/topic/390366/instruction-latency/ )
There is additional information about complete latent instructions: https://devtalk.nvidia.com/default/topic/419456/how-to-schedule-warps-/ - 24-28 hours for SP and 48-52 hours for DP.
Anandtech posted several snapshots of the AMD GPU, and we can assume that the basic ideas of pipelining should be the same for both suppliers: http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd -architects-for-compute / 4

Thus, fetching, decoding, and branching are common to all SIMD cores, and there are many ALU pipelines. In AMD, the register file is segmented between ALU groups, and in Nvidia it is displayed as a unit (but it can be implemented as segmented and accessed via the netwoork interconnect)
As stated in this work
However, fine-grained parallelism is what sets the GPU apart. Recall that threads execute synchronously in bundles known as strains. GPUs work most efficiently when the number of distortions in flight is large. Although only one warp can be serviced per cycle (Fermi technically maintains two half-warps per shader cycle), the SM scheduler will immediately switch to another active hyphen when a danger occurs. If the command flow generated by the CUDA compiler expresses ILP 3.0 (that is, on average three commands can be executed to danger), and the pipeline depth is 22 stages , only eight active skews (22/3) may be sufficient to fully hide latency of teams and achieve maximum arithmetic bandwidth. The latency of the GPU provides a good use of the enormous resources of the GPU with a small load on the programmer.
Thus, only one shaft at a time will be sent every hour from the conveyor pipeline (SM scheduler), and there is some delay between the scheduler manager and the ALU calculation completion time.
There is a part of the image from Realworldtech http://www.realworldtech.com/cayman/5/ and http://www.realworldtech.com/cayman/11/ with the Fermi pipeline. Note the note [16] in each ALU / FPU - this means that there are 16 of the same ALU physically.
