HLSL function / function counter

I am modeling some algorithms that will run on GPUs. Is there any link or something regarding how many loops uses various functions and calculations on modern hardware? (nvidia 5xx + series, amd 6xxx + series) I can’t find an official word about this, even if there is some mention of the increased normalization costs, square root and other functions throughout their documentation .. thanks.

+4
source share
3 answers

Unfortunately, the documentation about the number of cycles you are looking for either does not exist, or (if there is one), it will probably not be as useful as you expected. You are right that some of the more complex GPU commands require more time to execute than simpler ones, but the number of cycles is important only when the execution time of the command is the main performance bottleneck; GPUs are designed in such a way that it is very rare.

The way GPU shader programs achieve such high performance is to run many (potentially thousands) of shader threads in parallel. Each shader thread usually executes no more than one instruction before replacing it for another thread. Under ideal flight conditions, there are enough threads that some of them are always ready to fulfill their next instruction, so the GPU should never stop; this hides the latency of any operation performed by a single thread. If the GPU performs useful work in each cycle, then it is as if each shader command was executed in one cycle. In this case, the only way to make your program faster is to make it shorter (fewer instructions = fewer work cycles in general).

In more realistic conditions, when there is not enough work to fully load the GPU, the bottleneck is almost guaranteed to access memory, not ALU. A single texture choice can take thousands of cycles to return in the worst case; with unpredictable kiosks, as a rule, don’t worry that sqrt () takes more cycles than dot ().

So, the key to maximizing GPU performance is not using faster instructions. This is about maximizing employment, i.e. Providing enough work to keep the GPU busy enough to hide command / memory delays. It's about being smart in memory access, to minimize these excruciating round trips to DRAM. And sometimes, when you're really lucky, it uses fewer instructions.

+1
source

http://books.google.ee/books?id=5FAWBK9g-wAC&lpg=PA274&ots=UWQi5qznrv&dq=instruction%20slot%20cost%20hlsl&pg=PA210#v=onepage&q=table%20a-8&f=false

This is the closest thing I have found so far, it is outdated (cm3), but I think better than nothing.

0
source

Does the operator / function have a loop? I know that the assembly instructions have a cycle, that the time measurement is low and mainly depends on the CPU. Because the operator and functions are all high-level programming components. therefore, I do not think they have such a dimension.

-1
source

Source: https://habr.com/ru/post/1438007/


All Articles