Why can't the C ++ compiler optimize "if (test) --foo" to "foo - = test"?

I have a function that finds the next power of two for a given integer. If an integer is two, it returns power.

Pretty straightforward:

char nextpow2if(int a) { char foo = char(32 - __builtin_clz(a)); bool ispow2 = !(a & a-1); if (ispow2) --foo; return foo; } 

However, after compiling with gcc 6 with -O2, after checking the generated assembly, I see that after calculating foo-1 this compiled with the seemingly useless cmovne instruction. Even worse with gcc5 and older, I get the actual jne branch in the code.

A faster way to compile this would be as if I wrote the following function:

 char nextpow2sub(int a) { char foo = char(32 - __builtin_clz(a)); bool ispow2 = !(a & a-1); return foo - ispow2; } 

This code is correctly compiled by all compilers to the shortest (and fastest) possible build with sete and subtraction for bool.

Why can't the compiler optimize the first? This seems like a very simple case of identification. Why are gcc 5 and older compiling this into the actual jne branch? Is there a boundary case between the two versions that I don’t see that could lead to them behaving differently?

PS: demo here

Edit: I did not test performance using gcc 6, but with gcc 5 the latter is about twice as fast (well, at least on the synthetic analysis test). This is what really prompted me to ask this question.

+6
source share
2 answers

I believe that the reason for this may be that the bool usually stored in bytes. Therefore, the compiler may not be able to safely assume that the actual memory is exactly 1. The true / false check is probably just comparing to zero. Subtraction, however, may be a different story with side effects.

See an example code on Ideone :

 #include <iostream> using namespace std; union charBool { unsigned char aChar; bool aBool; }; int main() { charBool var; charBool* varMemory = &var; var.aBool = 65; std::cout << "a boolean = " << var.aBool << std::endl; std::cout << "a char = " << var.aChar << std::endl; std::cout << "varMemory = " << (*(reinterpret_cast<unsigned char*>(varMemory))) << std::endl; var.aChar = 98; // note: Ideone C++ compiler resolves this to zero, hence bit0 seems to be the only checked std::cout << "a boolean = " << var.aBool << std::endl; std::cout << "a char = " << var.aChar << std::endl; std::cout << "varMemory = " << (*(reinterpret_cast<unsigned char*>(varMemory))) << std::endl; return 0; } 

This leads to:

 a boolean = 1 a char = varMemory = a boolean = 0 a char = b varMemory = b 

(note: the first two characters are not printable)

0
source

Well, the compiler could indeed perform this optimization in this particular case without violating the standard. But consider the following slightly different case:

 char nextpow2sub(int a) { char foo = char(32 - __builtin_clz(a)); bool ispow2 = !(a & a-1); return foo - (5 * ispow2); } char nextpow2if(int a) { char foo = char(32 - __builtin_clz(a)); bool ispow2 = !(a & a-1); if (ispow2) foo = foo - 5; return foo; } 

The only change I made here is that I subtract 5, not 1. If you compile gcc 6.x and compare, you will see that the generated binary is the same size for both functions. I expect both to have more or less the same performance.

This shows that the optimization algorithm used by the compiler was designed to handle the general case. However, even for the case of subtracting by 1, I expect (using gcc 6.x) that there will be a slight performance difference on any modern processor that supports parallelism at the instruction level and renaming the register.

This code is correctly compiled by all compilers into the shortest (and fastest) possible build with sete and subtraction for bool .

How did you find out that this is the shortest and fastest code? Yes, it is shorter and faster, but do you have evidence that it is the shortest and fastest? Also, you cannot give such a statement without specifying a specific architecture and microarchitecture.

0
source

Source: https://habr.com/ru/post/1012368/


All Articles