Is the `if` expression redundant before the module and before the assignment of operations?

Consider the following code:

unsigned idx; //.. some work with idx if( idx >= idx_max ) idx %= idx_max; 

Only the second line can be simplified:

 idx %= idx_max; 

and achieve the same result.




Several times I met the following code:

 unsigned x; //... some work with x if( x!=0 ) x=0; 

Can be simplified to

 x=0; 



Questions:

  • Does it make sense to use if and why? Especially with the ARM Thumb instruction set.
  • Can these if be omitted?
  • What is the compiler optimization?
+46
c ++ performance c arm thumb
May 2 '17 at 6:10
source share
4 answers

If you want to understand what the compiler does, you just need to pull up the assembly. I recommend this site (I already entered the code from the question)): https://godbolt.org/g/FwZZOb .

The first example is more interesting.

 int div(unsigned int num, unsigned int num2) { if( num >= num2 ) return num % num2; return num; } int div2(unsigned int num, unsigned int num2) { return num % num2; } 

Forms:

 div(unsigned int, unsigned int): # @div(unsigned int, unsigned int) mov eax, edi cmp eax, esi jb .LBB0_2 xor edx, edx div esi mov eax, edx .LBB0_2: ret div2(unsigned int, unsigned int): # @div2(unsigned int, unsigned int) xor edx, edx mov eax, edi div esi mov eax, edx ret 

Basically, the compiler will not optimize the branch for very specific and logical reasons. If integer division were about the same cost as comparison, then the industry would be pretty pointless. But integer division (this module is performed together with typically) is actually very expensive: http://www.agner.org/optimize/instruction_tables.pdf . Numbers vary greatly in architecture and size, but usually it can be latency from 15 to 100 cycles.

By accepting a branch before executing a module, you can save a lot of work. Please note: the compiler also does not convert code without a branch to a branch at the assembly level. This is due to the fact that the branch also has a drawback: if the module is still necessary, you just spent a little time.

There is no way to make a reasonable determination of the correct optimization without knowing the relative frequency with which idx < idx_max will be true. Therefore, compilers (gcc and clang do the same) prefer to map the code in a relatively transparent way, leaving this choice in the hands of the developer.

So this thread could be a very smart choice.

The second branch should be completely meaningless, because comparison and assignment are comparable. However, you can see in the link that compilers will still not perform this optimization if they have a variable reference. If the value is a local variable (as in your demonstrated code), then the compiler will optimize the branch.

In total, the first part of the code is probably a reasonable optimization, the second is probably just a tired programmer.

+66
May 2 '17 at 7:26
source share

There are a number of situations where writing a variable with a value that it already has can be slower than reading, and the search already contains the desired value and skips the write. Some systems have a processor cache that immediately sends all write requests to memory. Although such projects are not commonplace today, they were fairly common, as they can provide a substantial part of the performance improvement that full read / write caching can offer, but at a fraction of the cost.

Code like the one above may also make sense in some situations with multiple CPUs. The most common such situation will be that code that runs simultaneously on two or more CPU cores will repeatedly hit the variable. In a multi-core caching system with a strong memory model, the kernel that wants to write the variable must first reconcile with other kernels in order to gain exclusive ownership of the cache line containing it, and then must negotiate again to refuse any such control next time any other the core wants to read or write. Such operations can be very expensive, and the costs will have to be borne even if each record simply stores a value that is already stored in the repository. If the location becomes zero and is never written again, both cores can hold the cache line at the same time for non-exclusive read-only access and should never discuss it further.

In almost all situations where multiple processors can beat a variable, the variable must be declared at least volatile . The only exception that may be applicable here would be in cases where all entries in the variable that occurs after the start of main() will store the same value, and the code will behave correctly regardless of whether it was available storage from one processor to another. If some operation would be wasteful several times, but otherwise harmless, and the purpose of the variable is to say whether it should be performed, then many implementations can create better code without the volatile qualifier than with, provided that they do not try to increase efficiency making a record unconditional.

By the way, if the object was accessible through a pointer, then there would be another possible reason for the above code: if the function is designed to accept either a const , where a certain field is zero or a const object that must have this field equal to zero, a code such as The above may be required to provide specific behavior in both cases.

+7
May 2 '17 at 19:38
source share

Refers to the first block of code: this is micro-optimization based on Chandler Carrutโ€™s recommendations for Clang (see here for more information), however itโ€™s not necessary that it will be actual micro-optimization in this form (using, if not three) , or on any compiler.

Modulo is a rather expensive operation, if the code is executed frequently, and there is a strong statistical load on one side or the other of the conditional number, predicting the CPU branch (considering the modern processor) will significantly reduce the cost of the branch instruction.

+2
May 2 '17 at 10:49
source share

It seems like a bad idea to use, if any, for me.

You're right. Whether or not idx >= idx_max , it will be under idx_max after idx %= idx_max . If idx < idx_max , it will not change whether or not if follows.

Although you might think that branching around a module could save time, the real culprit, I would say, is that when branches follow, pipelining a modern processor should reset their pipeline, and it costs a relatively long time. It is better not to follow the branch than a modulo integer, which costs about the same time as an integer division.

EDIT: It turns out that the module is rather slow in relation to the branch, as others suggest. Here's a guy looking at this same question: CppCon 2015: Chandler Carruth "C ++ Tuning: Tests, Processors, and Compilers! Oh My!" (suggested in another SO question, related to another answer to this question).

This guy writes compilers, and thought it would be faster without a branch; but his benchmarks proved that he was wrong. Even when the branch was taken only in 20% of cases, it was tested faster.

Another reason not to have it: if there are even fewer lines of code to support, and for someone else to decide what that means. The guy in the above link actually created the โ€œfaster moduleโ€ macro. IMHO, this or the built-in function is the way for mission-critical applications, because your code will be so much more understandable without a branch, but it will run so fast.

Finally, the guy from the aforementioned video plans to make this optimization known to the authors of the compiler. This way, if will probably be added for you, if not in the code. Therefore, only the mod will do this when it happens.

+1
May 02 '17 at 6:19 06:19
source share



All Articles