Why don't compilers combine redundant std :: atomic write?

Question

Why don't compilers combine redundant std :: atomic write?

I am wondering why compilers are not ready to combine consecutive records of the same value with one atomic variable, for example:

#include <atomic> std::atomic<int> y(0); void f() { auto order = std::memory_order_relaxed; y.store(1, order); y.store(1, order); y.store(1, order); }

Every compiler I tried produces the above entry three times. What legitimate race-independent observer could see the difference between the code above and the optimized version with one record (ie, the as-if rule does not apply)?

If the variable was volatile, then obviously optimization is not applicable. What prevents this from happening in my case?

Here is the code in the compiler explorer .

+47

c ++ compiler-optimization multithreading c ++ 11 stdatomic

PeteC Aug 30 '17 at 12:25

source share

9 answers

You mean the destruction of dead shops.

The destruction of the atomic dead storage is not excluded, but it is more difficult to prove that the nuclear store qualifies as such.

Traditional compiler optimizations, such as killing dead storage, can be performed on atomic operations, even sequentially agreed upon.
Optimizers should be careful not to do this at synchronization points, because another thread of execution can observe or modify memory, which means that traditional optimization should take into account more complex commands than usual when considering optimization for atomic operations. In the case of the destruction of a dead store, it is not enough to prove that the nuclear store after domination and pseudonyms is different to exclude another store.
^{from N4455 No Sane Compiler optimizes Atomics}

The problem of atomic DSE in the general case is that it involves the search for synchronization points, in my understanding this term means the points in the code where it happens - to the relationship between the instruction on thread A and the instruction on another thread B.

Consider this code executed by thread A:

 y.store(1, std::memory_order_seq_cst); y.store(2, std::memory_order_seq_cst); y.store(3, std::memory_order_seq_cst);

Is it possible to optimize it as y.store(3, std::memory_order_seq_cst) ?

If thread B waits to see y = 2 (for example, with CAS), it will never see that if the code is optimized.

However, in my opinion, the presence of B and CASsing cycles at y = 2 is a data race, since there is no common order between the instructions of the two threads.
Execution when instructions A are executed before cycle B is observable (i.e. allowed), and thus the compiler can optimize to y.store(3, std::memory_order_seq_cst) .

If threads A and B are in some way synchronized between the repositories in thread A, optimization will not be allowed (partial ordering may be called, which may lead to potential observation B y = 2 ).

The proof that such synchronization does not exist is related to the consideration of a wider field and taking into account all the features of the architecture.

As for my understanding, due to the relatively small age of atomic operations and the difficulty of reasoning about memory ordering, visibility, and synchronization, compilers do not perform all possible atomic optimizations until a more reliable structure for detecting and understanding the necessary conditions is created.

I believe that your example is a simplification of the counting stream above, since it does not have a single stream or synchronization point, since I see that the compiler could optimize the three stores.

+41

Margaret Bloom Aug 30 '17 at 16:09 on

source share

As long as you change the value of an atom in one thread, some other threads can check it and perform an operation based on the value of the atom. The example you gave is so specific that the developers of the compiler do not consider it optimal. However, if one thread specifies, for example. sequential values for the atom: 0 , 1 , 2 , etc., another thread can put something in the slots indicated by the value of the atom.

+9

Serge Rogatch Aug 30 '17 at 13:25

source share

In short, because the standard (for example, paragraphs around and below 20 in [intro.multithread] ) prohibits this.

There are cases — before guarantees that must be fulfilled, and which, among other things, exclude reordering or coalescing entries (in paragraph 19 it is even said so clearly about reordering).

If your thread writes three values to memory (say, 1, 2, and 3) one after another, another thread can read the value. If, for example, your thread is interrupted (or even if it is running at the same time), and another thread also writes to this place, then the observing thread should see the operations in exactly the same order as they occur (either by planning, or by coincidence, or for any reason). This is a guarantee.

How is this possible if you make only half the record (or even only one)? This is not true.

What if your thread writes 1 -1 -1 instead, and the other one writes 2 or 3 sporadically? What if the third thread monitors the location and waits for a certain value that simply does not appear because it is optimized?

It is not possible to provide guarantees that are provided if stores (and loads, too) are not performed on request. They are all in the same order.

+5

Damon Aug 30 '17 at 13:30

source share

NB: I was going to comment on this, but it's a little too verbose.

An interesting fact is that this behavior is not in terms of the C ++ data race.

Remark 21 on page 14 is interesting: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf (my emphasis):

A program execution contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic

Also on page 11 note 5:

“Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot race data.

Thus, a conflicting action on an atom is never a data race - in terms of the C ++ standard.

All these operations are atomic (and, in particular, relaxed), but there is no data.

I disagree with the reliable / predictable difference between the two on any (reasonable) platform:

 include <atomic> std::atomic<int> y(0); void f() { auto order = std::memory_order_relaxed; y.store(1, order); y.store(1, order); y.store(1, order); }

and

 include <atomic> std::atomic<int> y(0); void f() { auto order = std::memory_order_relaxed; y.store(1, order); }

But in the definition presented by the C ++ memory model, this is not a data race.

I can’t easily understand why this definition is provided, but it gives the developer several cards to participate in the messy connection between threads that they can know (on their platform) will statistically work.

For example, setting a value 3 times and then reading it will display a certain degree of competition for that location. Such approaches are not deterministic, but many efficient parallel algorithms are not deterministic. For example, the try_lock_until() timeout is always a race condition, but remains a useful technique.

What seems like a C ++ standard gives you certainty around “data races,” but it does allow for some fun i-games with race conditions that end up analyzing different things.

Briefly, the standard indicates that if other threads can see the effect of "clogging" the value set 3 times, other threads should be able to see this effect (even if they sometimes may not be!). This is the case when almost all modern platforms, which other threads can see under certain circumstances, are clogged.

+5

Persixty Aug 30 '17 at 13:43 on

source share

A practical example of using a template if a thread does something important between updates independent of or changes to y might be: * Thread 2 reads the y value to see how much Thread 1 has made progress.

So, it is possible that Thread 1 should load the configuration file as Step 1, put its parsed content in the data structure as Step 2, and display the main window as Step 3, while Thread 2 waits in Step 2 for completion so it can perform another task in parallel, depending on the data structure. (Of course, this example requires receiving / issuing semantics, not relaxed ordering.)

Im pretty sure that the appropriate implementation allows Thread 1 not to update y at any intermediate stage - while I havent compared to the locale, I would be shocked if it does not support hardware on which another polling of thread y may never do not see the value 2.

Nevertheless, this is a hypothetical instance where optimistic status updates can be pessimally optimized. Maybe the compiler developer will come here and say why this compiler decided not to, but one of the possible reasons is to allow you to shoot yourself in the foot or at least plunge into the sock.

+2

Davislor Aug 30 '17 at 21:13

source share

Go a little further from the pathological case when three stores are next to each other. Suppose that there is some nontrivial work between repositories, and that such work does not include y at all (so analyzing the data path can determine that the three stores are actually redundant, at least within this stream), and does not introduce any barriers memory (so something else doesn't make stores visible to other threads). Now it is entirely possible that other threads have the ability to do work between repositories, and perhaps these other threads control y and that this thread has some reasons to require resetting it to 1 (second store). If the first two stores were dropped, this will change the behavior.

0

Andre Kostur Aug 30 '17 at 13:57 on

source share

The compiler writer cannot just do the optimization. They must also convince themselves that optimization is valid in situations where the compiler’s author intends to apply it, that it will not be applied in situations where it is invalid, that it does not violate code that is actually broken, but “works” on other implementations . This is probably more work than the optimization itself.

, , ( , , ), .

, , , , .

-one

gnasher729 30 . '17 22:45

source share

, , std:: atomic, , , , volatile.

, ..

[EDIT2] , std:: atomic < > volatile . C/++, volatile , , ISR, ( , ).

, . , , ? , std:: atomic < > .

, .

 #include <atomic> #include <thread> static const int N{ 1000000 }; std::atomic<int> flag{1}; std::atomic<bool> do_run { true }; void write_1() { while (do_run.load()) { flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; flag = 1; } } void write_0() { while (do_run.load()) { flag = -1; flag = -1; flag = -1; flag = -1; } } int main(int argc, char** argv) { int counter{}; std::thread t0(&write_0); std::thread t1(&write_1); for (int i = 0; i < N; ++i) { counter += flag; std::this_thread::yield(); } do_run = false; t0.join(); t1.join(); return counter; }

[EDIT] , volatile , ...

, volatile - , . VS2017 stl. , volatile .

 // from file atomic, line 264... // TEMPLATE CLASS _Atomic_impl template<unsigned _Bytes> struct _Atomic_impl { // struct for managing locks around operations on atomic types typedef _Uint1_t _My_int; // "1 byte" means "no alignment required" constexpr _Atomic_impl() _NOEXCEPT : _My_flag(0) { // default constructor } bool _Is_lock_free() const volatile { // operations that use locks are not lock-free return (false); } void _Store(void *_Tgt, const void *_Src, memory_order _Order) volatile { // lock and store _Atomic_copy(&_My_flag, _Bytes, _Tgt, _Src, _Order); } void _Load(void *_Tgt, const void *_Src, memory_order _Order) const volatile { // lock and load _Atomic_copy(&_My_flag, _Bytes, _Tgt, _Src, _Order); } void _Exchange(void *_Left, void *_Right, memory_order _Order) volatile { // lock and exchange _Atomic_exchange(&_My_flag, _Bytes, _Left, _Right, _Order); } bool _Compare_exchange_weak( void *_Tgt, void *_Exp, const void *_Value, memory_order _Order1, memory_order _Order2) volatile { // lock and compare/exchange return (_Atomic_compare_exchange_weak( &_My_flag, _Bytes, _Tgt, _Exp, _Value, _Order1, _Order2)); } bool _Compare_exchange_strong( void *_Tgt, void *_Exp, const void *_Value, memory_order _Order1, memory_order _Order2) volatile { // lock and compare/exchange return (_Atomic_compare_exchange_strong( &_My_flag, _Bytes, _Tgt, _Exp, _Value, _Order1, _Order2)); } private: mutable _Atomic_flag_t _My_flag; };

MS stl volatile .

:

  inline int _Atomic_compare_exchange_strong_8(volatile _Uint8_t *_Tgt, _Uint8_t *_Exp, _Uint8_t _Value, memory_order _Order1, memory_order _Order2)

volatile uint8_t* , , std:: atomic. MS std:: atomic < > . gcc stl -.

-2

Michaël Roy 30 . '17 13:02

source share

Peter Cordes · Accepted Answer · 2017-08-30 23:50

The written C ++ 11 / C ++ 14 standards allow you to combine / merge three stores in one store of final value. Even so:

  y.store(1, order); y.store(2, order); y.store(3, order); // inlining + constant-folding could produce this in real code

The standard does not guarantee that an observer rotating on y (with atomic load or CAS) will ever see y == 2 . A program that depended on this would have a data race error, but only a garden variety race, not a C ++ Undefined Behavior data race. (This is UB only with non-atomic variables). A program that expects to see it sometimes is not necessarily even buggy. (See below re: progress bars.)

Any ordering that is possible on an abstract C ++ machine can be selected (at compile time) as the ordering that will always happen . It’s like a rule in action. In this case, it would be as if all three storages occurred closely in the global order, while between y=1 and y=3 there were no downloads or storages from other streams.

It does not depend on the target architecture or equipment; just like during compilation, reordering of relaxed atomic operations is allowed even when aiming at a strictly ordered x86 The compiler does not need to keep what you can expect from thinking about the equipment for which you are collecting, so you need barriers. Barriers can be compiled into instructions without zero.

So why don't compilers do this optimization?

This is an implementation quality issue and may change the observed performance / behavior on real hardware.

The most obvious case when this is a problem is a progress bar . Pulling the storages from the loop (which does not contain other atomic operations) and combining them all into one will result in the progress bar remaining at 0, and then reaching 100% at the end.

There is no C ++ 11 std::atomic way to stop them from doing this when you do not want it, so now compilers simply choose to never combine multiple atomic operations into one. (Combining them all in one operation does not change their order relative to each other.)

The authors of the compilers correctly noted that programmers expect atomic storage to happen with memory every time the source does y.store() . (See Most of the other answers to this question, which argue that stores should occur separately, because potential readers expect to see an intermediate value.) That is. this violates the principle of least surprise .

However, there are cases where it would be very useful, for example, to avoid useless counting of shared_ptr inc / dec links in a loop.

Obviously, any change of order or association cannot violate any other ordering rules. For example, num++; num--; num++; num--; num++; num--; there should still be a complete barrier to runtime and reordering at compile time, even if it no longer affects memory at num .

Currently, std::atomic can extend the std::atomic API to give programmers control over such optimizations, after which compilers will be able to optimize them when it is useful, which can even happen in carefully written code that is not intentionally ineffective. Some examples of useful cases for optimization are mentioned in the following discussion / suggestions links of the working group:

http://wg21.link/n4455 : N4455 No sane compiler will optimize Atomics
http://wg21.link/p0062 : WG21 / P0062R1: When should compilers optimize atomicity?

See also a discussion of the same topic in Richard Hodges's answer to Can num ++ being atomic for "int num"? (see comments). See also the last section of my answer to the same question, where I state in more detail that this optimization is allowed. (Let’s leave it short here because these C ++ working group links already recognize that the current standard, as written, allows this, and that the current compilers simply don’t optimize specifically.)

In the current standard, volatile atomic<int> y will be one way to ensure that the storage for it is not optimized. (As Herb Sutter points out in the SO answer , volatile and atomic already share some requirements, but they are different). See also the link std::memory_order to volatile for cppreference.

Access to volatile objects cannot be optimized (for example, it can be memory I / O registers).

Using volatile atomic<T> basically solves the progress bar problem, but it's kind of ugly and might look silly after a few years if / when C ++ decides to use a different syntax to control optimization so that compilers can start to do it in practice.

I think we can be sure that compilers will not begin to perform this optimization until there is a way to control it. We hope that this will be a kind of memory_order_release_coalesce (e.g. memory_order_release_coalesce ), which will not change the behavior of existing C ++ 11/14 code when compiled as C ++. But this may be similar to the sentence in wg21 / p0062: the tag does not optimize cases with [[brittle_atomic]] .

wg21 / p0062 warns that even volatile atomic atoms do not solve all problems, and does not recommend using it for this purpose . This gives this example:

 if(x) { foo(); y.store(0); } else { bar(); y.store(0); // release a lock before a long-running loop for() {...} // loop contains no atomics or volatiles } // A compiler can merge the stores into a y.store(0) here.

Even with volatile atomic<int> y , the compiler is allowed to y.store() from if/else and just do it once, because it still does exactly 1 storage with the same value. (What will happen after a long cycle in the else branches). Especially if the store is only relaxed or release instead of seq_cst .

volatile stops the union discussed in this question, but this indicates that other atomic<> optimizations can also create problems for real performance.

Other reasons for non-optimization include: no one wrote complex code that would allow the compiler to safely perform these optimizations (without even making a mistake). This is not enough, because N4455 says that LLVM already implements or can easily implement some of the optimizations mentioned.

However, the cause of error for programmers is likely. Non-blocking code is complex enough to write correctly first.

Do not be careful when using atomic weapons: they are not cheap and do not optimize much (currently not at all). However, it is not always easy to avoid redundant atomic operations with std::shared_ptr<T> since there is no non-atomic version (although one of the answers here provides an easy way to define shared_ptr_unsynchronized<T> for gcc).

Why don't compilers combine redundant std :: atomic write?

So why don't compilers do this optimization?

More articles: