The written C ++ 11 / C ++ 14 standards allow you to combine / merge three stores in one store of final value. Even so:
y.store(1, order); y.store(2, order); y.store(3, order); // inlining + constant-folding could produce this in real code
The standard does not guarantee that an observer rotating on y (with atomic load or CAS) will ever see y == 2 . A program that depended on this would have a data race error, but only a garden variety race, not a C ++ Undefined Behavior data race. (This is UB only with non-atomic variables). A program that expects to see it sometimes is not necessarily even buggy. (See below re: progress bars.)
Any ordering that is possible on an abstract C ++ machine can be selected (at compile time) as the ordering that will always happen . It’s like a rule in action. In this case, it would be as if all three storages occurred closely in the global order, while between y=1 and y=3 there were no downloads or storages from other streams.
It does not depend on the target architecture or equipment; just like during compilation, reordering of relaxed atomic operations is allowed even when aiming at a strictly ordered x86 The compiler does not need to keep what you can expect from thinking about the equipment for which you are collecting, so you need barriers. Barriers can be compiled into instructions without zero.
So why don't compilers do this optimization?
This is an implementation quality issue and may change the observed performance / behavior on real hardware.
The most obvious case when this is a problem is a progress bar . Pulling the storages from the loop (which does not contain other atomic operations) and combining them all into one will result in the progress bar remaining at 0, and then reaching 100% at the end.
There is no C ++ 11 std::atomic way to stop them from doing this when you do not want it, so now compilers simply choose to never combine multiple atomic operations into one. (Combining them all in one operation does not change their order relative to each other.)
The authors of the compilers correctly noted that programmers expect atomic storage to happen with memory every time the source does y.store() . (See Most of the other answers to this question, which argue that stores should occur separately, because potential readers expect to see an intermediate value.) That is. this violates the principle of least surprise .
However, there are cases where it would be very useful, for example, to avoid useless counting of shared_ptr inc / dec links in a loop.
Obviously, any change of order or association cannot violate any other ordering rules. For example, num++; num--; num++; num--; num++; num--; there should still be a complete barrier to runtime and reordering at compile time, even if it no longer affects memory at num .
Currently, std::atomic can extend the std::atomic API to give programmers control over such optimizations, after which compilers will be able to optimize them when it is useful, which can even happen in carefully written code that is not intentionally ineffective. Some examples of useful cases for optimization are mentioned in the following discussion / suggestions links of the working group:
See also a discussion of the same topic in Richard Hodges's answer to Can num ++ being atomic for "int num"? (see comments). See also the last section of my answer to the same question, where I state in more detail that this optimization is allowed. (Let’s leave it short here because these C ++ working group links already recognize that the current standard, as written, allows this, and that the current compilers simply don’t optimize specifically.)
In the current standard, volatile atomic<int> y will be one way to ensure that the storage for it is not optimized. (As Herb Sutter points out in the SO answer , volatile and atomic already share some requirements, but they are different). See also the link std::memory_order to volatile for cppreference.
Access to volatile objects cannot be optimized (for example, it can be memory I / O registers).
Using volatile atomic<T> basically solves the progress bar problem, but it's kind of ugly and might look silly after a few years if / when C ++ decides to use a different syntax to control optimization so that compilers can start to do it in practice.
I think we can be sure that compilers will not begin to perform this optimization until there is a way to control it. We hope that this will be a kind of memory_order_release_coalesce (e.g. memory_order_release_coalesce ), which will not change the behavior of existing C ++ 11/14 code when compiled as C ++. But this may be similar to the sentence in wg21 / p0062: the tag does not optimize cases with [[brittle_atomic]] .
wg21 / p0062 warns that even volatile atomic atoms do not solve all problems, and does not recommend using it for this purpose . This gives this example:
if(x) { foo(); y.store(0); } else { bar(); y.store(0);
Even with volatile atomic<int> y , the compiler is allowed to y.store() from if/else and just do it once, because it still does exactly 1 storage with the same value. (What will happen after a long cycle in the else branches). Especially if the store is only relaxed or release instead of seq_cst .
volatile stops the union discussed in this question, but this indicates that other atomic<> optimizations can also create problems for real performance.
Other reasons for non-optimization include: no one wrote complex code that would allow the compiler to safely perform these optimizations (without even making a mistake). This is not enough, because N4455 says that LLVM already implements or can easily implement some of the optimizations mentioned.
However, the cause of error for programmers is likely. Non-blocking code is complex enough to write correctly first.
Do not be careful when using atomic weapons: they are not cheap and do not optimize much (currently not at all). However, it is not always easy to avoid redundant atomic operations with std::shared_ptr<T> since there is no non-atomic version (although one of the answers here provides an easy way to define shared_ptr_unsynchronized<T> for gcc).