Cuda: the overhead of the deviation from deformation over additional arithmetic

Of course, a divergence in warp with the if and switch should be avoided at all costs on GPUs.

But what is the overhead of warp rejection (scheduling only certain threads to execute certain lines) compared to the extra useless arithmetic?

Consider the following dummy example:

verison 1:

 __device__ int get_D (int A, int B, int C) { //The value A is potentially different for every thread. int D = 0; if (A < 10) D = A*6; else if (A < 17) D = A*6 + B*2; else if (A < 26) D = A*6 + B*2 + C; else D = A*6 + B*2 + C*3; return D; } 

vs.

version 2:

 __device__ int get_D (int A, int B, int C) { //The value A is potentially different for every thread. return A*6 + (A >= 10)*(B*2) + (A < 26)*C + (A >= 26)*(C*3); } 

My real scenario is more complex (more conditions), but the same idea.

Questions:

Are the overheads (when planning) of the differences in warp so great that version 1) is slower than version 2?

Version 2 requires a lot more ALUs than version 1, and most of them get lost in "multiplying by 0" (only a few selected conditional expressions are evaluated at 1, not 0). Is this due to valuable ALUs in useless operations, delaying instructions in other distortions?

+4
source share
1 answer

Specific answers to such questions are usually difficult to provide. Many factors influence the analysis of comparisons between two cases:

  • You say that A is potentially different for each thread, but the extent to which this is true will really affect the comparison.
  • In general, regardless of whether your code affects binding or bandwidth limitation, it certainly affects the response. (If your code is bandwidth-related, the difference between the two cases may not differ).
  • I know that you identified A, B, C as integers, but seemingly harmless changes, such as their float , could significantly affect the answer.

Fortunately, there are profiling tools that can help give clear, specific answers (or perhaps indicate that there isn’t much difference between the two cases.) You did a pretty good job of pointing out 2 specific cases that you care about. Why not compare 2? And if you want to dig deeper, profiling tools can provide statistics on the replay of commands (due to the difference in vanes), bandwidth / computation, etc.

I need to make an exception with this declaration:

Of course, warp rejection with if and switch statements should be avoided at all costs on GPUs.

This is simply not true. The ability of a machine to handle diverging control flow is actually a function that allows us to program it in more friendly languages, such as C / C ++, and actually distinguishes it from some other acceleration technologies that do not provide the programmer with this flexibility.

Like any other optimization effort, you must first focus on heavy lifting. Does this code provide the bulk of your work? In most cases, it makes no sense to invest this level of analytical effort in something that is basically a brand code or is not part of the main work of your application.

And if this is the main part of the efforts of your code, then profiling tools are a really effective way to get good meaningful answers, which are likely to be more useful than trying to conduct an academic analysis.

Now for my punch on your questions:

Are the overheads (when planning) of the differences in warp so great that version 1) is slower than version 2?

This will depend on the particular level of branching that actually occurs. In the worst case, with fully independent tracks for 32 threads, the machine is fully serialized and you operate at 1/32 of maximum performance. A subdivision of a tree type of a binary decision tree cannot lead to this worst case, but it certainly can get closer to it at the end of the tree. Perhaps there will be a more than 50% slowdown in this code, possibly 80% or higher, due to the complete divergence of the threads at the end. But this will depend statistically on how often the discrepancy occurs (that is, it depends on the data). In the worst case, I expect version 2 to be faster.

Version 2 requires a lot more ALUs than version 1, and most of them get lost in "multiplying by 0" (only a few selected conditional expressions are evaluated at 1, not 0). Is this due to valuable ALUs in useless operations, delaying instructions in other distortions?

float vs. int can really help here, and maybe you could consider exploring. But the second case appears (for me) to have all the same comparisons as the first case, but a few additional multiplications. In the float case, the machine can do one multiplication per stream per cycle, so it is quite fast. In this case, it is slower, and you can see specific workloads depending on the architecture here . I would not be too concerned about this level of arithmetic. And again, this may not matter if your application is limited by memory bandwidth.

Another way to tease it all would be to write kernels that compare codes of interest, compile into ptx ( nvcc -ptx ... ), and compare ptx commands. This gives a much better idea of ​​what the machine thread code will look like in each case, and if you just do something like the number of instructions, you may not see much difference between the two cases (which in this case should favor option 2 )

+3
source

Source: https://habr.com/ru/post/1482602/


All Articles