CUDA / openCL; rewriting branches as non-branching expressions

In most cases, a branch is required in a CUDA or OpenCL program, for example:

for (int i=0; i<width; i++) { if( i % threadIdx.x == 0) quantity += i*i; } 

the code can always (or at least most of the time) be rewritten in style without branching:

 for (int i=0; i<width; i++) { quantity += i*i* (i % threadIdx.x != 0); } 

the compromise, it seems, either works in one warp slot, or performs more calculations on all threads (in the second case, the sum is always executed, only sometimes this value is zero)

Assuming branching operations will take several warp slots for each possible branch, one would expect the second to be consistently better than the first. Now my question is: Can I rely on the compiler to optimize 1) to 2) whenever it has meaning, or there are no widely applicable criteria, which implies that it cannot be decided as a whole, which is better without trying and not profiling?

+6
source share
3 answers

Modulo operations are quite expensive: I'm pretty sure that adding in a module will use more time than just having one command executed by only 1 thread. Your only branching if , if without else , will only hang in other threads while it executes. Since gpus is optimized for very fast context switching, there should be very little cost involved.

You are advised to use long branching operators: too many consecutive calculations on the GPU (i.e. one thread doing all the work) negates the advantage of parallelism.

+3
source

In my experience, compiler developers optimize these kinds of edge cases completely.

Can I think of any cases where 1) cannot be switched to 2)? Here is one: I wrote kernels where it was more efficient to run certain parts of the calculations every 10 threads or something like that, and in this case such an optimization cannot be deduced, even if there is a mathematical operation (divide the subtraction) that can give the same result, regardless of the conditional and "run at all, but do not produce results."

However, even though the threadId == 0 check is a fairly common scenario, I have no idea if it is really optimized. I would say that it depends on the implementation And even the device itself (CPU vs GPU).

You need to try to really find out what works best, not only for the reason above, but also because a work scheduler can behave differently depending on how expensive it is to schedule / start / stop a set of threads as against all of them being executed (and most of them provided zero results).

Hope this helps!

+1
source

I don’t have many memories of CUDA, but why don’t you parallelize your loop? You must use atomic operations [1] to add calculations. I hope this helps you! Sorry if this is not the case.

0
source

Source: https://habr.com/ru/post/915839/


All Articles