Unfortunately, the documentation about the number of cycles you are looking for either does not exist, or (if there is one), it will probably not be as useful as you expected. You are right that some of the more complex GPU commands require more time to execute than simpler ones, but the number of cycles is important only when the execution time of the command is the main performance bottleneck; GPUs are designed in such a way that it is very rare.
The way GPU shader programs achieve such high performance is to run many (potentially thousands) of shader threads in parallel. Each shader thread usually executes no more than one instruction before replacing it for another thread. Under ideal flight conditions, there are enough threads that some of them are always ready to fulfill their next instruction, so the GPU should never stop; this hides the latency of any operation performed by a single thread. If the GPU performs useful work in each cycle, then it is as if each shader command was executed in one cycle. In this case, the only way to make your program faster is to make it shorter (fewer instructions = fewer work cycles in general).
In more realistic conditions, when there is not enough work to fully load the GPU, the bottleneck is almost guaranteed to access memory, not ALU. A single texture choice can take thousands of cycles to return in the worst case; with unpredictable kiosks, as a rule, donβt worry that sqrt () takes more cycles than dot ().
So, the key to maximizing GPU performance is not using faster instructions. This is about maximizing employment, i.e. Providing enough work to keep the GPU busy enough to hide command / memory delays. It's about being smart in memory access, to minimize these excruciating round trips to DRAM. And sometimes, when you're really lucky, it uses fewer instructions.
source share