Does atomic_thread_fence (memory_order_seq_cst) have the semantics of a complete memory barrier?

A full / common memory barrier is one where all LOAD and STORE operations indicated before the barrier will be displayed before all LOAD and STORE operations specified after the barrier with respect to other system components.

According to cppreference , memory_order_seq_cst is equal to memory_order_acq_rel plus a single common modification order for all operations marked in this way. But, as far as I know, neither in C ++ 11, nor in the acquisition, nor in the release of the fence provides for the #StoreLoad ordering (loading after the store). Locking fence requires that no previous read / write operations can be reordered with any subsequent write; The capture fence requires that no subsequent read / write can be reordered with any previous read. Please correct me if I am wrong;)

With an example

 atomic<int> x; atomic<int> y; y.store(1, memory_order_relaxed); //(1) atomic_thread_fence(memory_order_seq_cst); //(2) x.load(memory_order_relaxed); //(3) 

Is the optimization compiler allowed to reorder instruction (3) to (1) so that it looks like this:

 x.load(memory_order_relaxed); //(3) y.store(1, memory_order_relaxed); //(1) atomic_thread_fence(memory_order_seq_cst); //(2) 

If this is a valid transformation, then it proves that atomic_thread_fence(memory_order_seq_cst) does not necessarily include the semantics of what has a complete barrier.

+6
source share
2 answers

atomic_thread_fence(memory_order_seq_cst) always generates a complete barrier.

  • x86_64: MFENCE
  • PowerPC: hwsync
  • Itanuim: mf
  • ARMv7 / ARMv8: dmb ish
  • MIPS64: sync

The main thing : the observing stream can simply observe in a different order and it does not matter which barriers you use in the observed topic.

Is the optimization compiler allowed to reorder instruction (3) to to (1)?

No, it is forbidden. But in a globally visible multithreaded program, this is true only if:

  • other threads use the same memory_order_seq_cst for atomic read / write operations with these values
  • or if other threads use the same atomic_thread_fence(memory_order_seq_cst); between load () and store (), too, but this approach does not guarantee consistent consistency in general, since consistent consistency is a more reliable guarantee.

Working draft, standard for the C ++ programming language 2016-07-12: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

§ 29.3 Order and consistency

§ 29.3 / 8

[Note: memory_order_seq_cst provides consistent consistency only for a program that is free of data races and uses only memory_order_seq_cst operations. Any use of weaker orders will void this warranty unless extreme care is taken. In particular, memory_order_seq_cst fences provide a general order only for the fences themselves. Fences cannot, as a rule, be used to restore consistent consistency for atomic operations with weaker ordering parameters. - final note]


How it can be compared with the assembler:

Case-1:

 atomic<int> x, y y.store(1, memory_order_relaxed); //(1) atomic_thread_fence(memory_order_seq_cst); //(2) x.load(memory_order_relaxed); //(3) 

This code is not always equivalent to the value of Case-2, but this code creates the same instructions between STORE and LOAD, and if both LOAD and STORE use memory_order_seq_cst , this is Sequential Consistency, which prevents StoreLoad, Case-2 reordering:

 atomic<int> x, y; y.store(1, memory_order_seq_cst); //(1) x.load(memory_order_seq_cst); //(3) 

With some notes:

  • it can add duplicate instructions (as in the following example for MIPS64)
  • or can use similar operations in the form of other instructions:

Manual for ARMv8-A

Table 13.1. Barrier parameters

ISH Any - Any

Any - Any This means that both loads and shops must be completed before the barrier. Both loads and stores that appear after the barrier in the order of the program must wait for the completion of the barrier.

Preventing the reordering of two instructions can be accomplished with additional instructions between the two. And since we see that the first STORE (seq_cst) and the next LOAD (seq_cst) generate commands between it, the same as FENCE (seq_cst) ( atomic_thread_fence(memory_order_seq_cst) )

Mapping C / C ++ 11 memory_order_seq_cst to distinguish CPU architectures for: load() , store() , atomic_thread_fence() :

Note atomic_thread_fence(memory_order_seq_cst); always generates a complete barrier:

  • x86_64: STORE- MOV (into memory), MFENCE , LOAD- MOV (from memory) , fence- MFENCE

  • x86_64-alt: STORE- MOV (into memory) , LOAD- MFENCE ,MOV (from memory) , fence- MFENCE

  • x86_64-alt3: STORE- (LOCK) XCHG , LOAD- MOV (from memory) , fence- MFENCE - complete barrier

  • x86_64-alt4: STORE- MOV (into memory) , LOAD- LOCK XADD(0) , fence- MFENCE - complete barrier

  • PowerPC: STORE- hwsync; st hwsync; st , LOAD- hwsync; ld; cmp; bc; isync ld; cmp; bc; isync , fence- hwsync

  • Itanium: STORE- st.rel; mf , LOAD- ld.acq , fence- mf

  • ARMv7: STORE- dmb ish; str; dmb ish; str; dmb ish , LOAD- ldr; dmb ish ldr; dmb ish fence - dmb ish

  • ARMv7-alt: STORE- dmb ish; str dmb ish; str , LOAD- dmb ish; ldr; dmb ish ldr; dmb ish , fence- dmb ish

  • ARMv8 (AArch32): STORE- STL , LOAD- LDA , fence- dmb ish - full barrier

  • ARMv8 (AArch64): STORE- STLR , LOAD- LDAR , fence- dmb ish - full barrier

  • MIPS64: STORE- sync; sw; sync; sw; sync; LOAD sync; lw; sync; sync; lw; sync; , fence- sync

All C / C ++ 11 semantics mapping for difference processor architectures is described for: load (), store (), atomic_thread_fence (): <a13>

Since Sequential-Consistency prevents StoreLoad reordering, and because Sequential-Consistency ( store(memory_order_seq_cst) and next load(memory_order_seq_cst) ) generates commands, between them are the same as atomic_thread_fence(memory_order_seq_cst) , then atomic_thread_fence(memory_order_seq_cst) memory_thread_fread_orders

+1
source

C ++ fences are not direct equivalents to CPU blocking instructions, although they may well be implemented as such. C ++ cufflinks are part of the C ++ memory model, which deals with visibility and order restrictions.

Given that processors usually change the read and write order and cache values ​​locally before they become available to other cores or processors, the order in which effects become visible to other processors is usually not predictable.

If you are thinking about this semantics, it is important to think about what exactly you are trying to prevent.

Suppose the code maps to machine instructions as indicated (1), then (2), then (3), and these instructions ensure that (1) is displayed globally before (3) is executed.

The whole purpose of this fragment is to communicate with another thread. You cannot guarantee that another thread is running on any processor while this fragment is running on our processor. Therefore, the entire fragment can work without interruption and (3) will still read any value in x when (1) was executed. In this case, it is indistinguishable from the order of fulfillment (3) (1) (2).

So: yes, this is an allowed optimization because you cannot tell the difference.

0
source

Source: https://habr.com/ru/post/974284/


All Articles