In most cases, a branch is required in a CUDA or OpenCL program, for example:
for (int i=0; i<width; i++) { if( i % threadIdx.x == 0) quantity += i*i; }
the code can always (or at least most of the time) be rewritten in style without branching:
for (int i=0; i<width; i++) { quantity += i*i* (i % threadIdx.x != 0); }
the compromise, it seems, either works in one warp slot, or performs more calculations on all threads (in the second case, the sum is always executed, only sometimes this value is zero)
Assuming branching operations will take several warp slots for each possible branch, one would expect the second to be consistently better than the first. Now my question is: Can I rely on the compiler to optimize 1) to 2) whenever it has meaning, or there are no widely applicable criteria, which implies that it cannot be decided as a whole, which is better without trying and not profiling?
source share