Proven implementation of the Peterson Lock algorithm?

Does anyone know of a good / proper implementation of the Peterson Lock algorithm in C? I can't seem to find this. Thanks.

+6
source share
2 answers

I will not make any statements about how good or proper the implementation is, but it has been tested (briefly). This is a direct translation of the algorithm described on Wikipedia.

struct petersonslock_t { volatile unsigned flag[2]; volatile unsigned turn; }; typedef struct petersonslock_t petersonslock_t; petersonslock_t petersonslock () { petersonslock_t l = { { 0U, 0U }, ~0U }; return l; } void petersonslock_lock (petersonslock_t *l, int p) { assert(p == 0 || p == 1); l->flag[p] = 1; l->turn = !p; while (l->flag[!p] && (l->turn == !p)) {} }; void petersonslock_unlock (petersonslock_t *l, int p) { assert(p == 0 || p == 1); l->flag[p] = 0; }; 

Greg points out that in an SMP architecture with slightly weakened memory connectivity (for example, x86), although the loads in the same memory cell are ordered, loading to different places on one processor may be out of order to the other processor.

Jens Gustedt and ninjalj recommend modifying the original algorithm to use atomic_flag type. This means that setting flags and revolutions will use atomic_flag_test_and_set , and clearing them will use atomic_flag_clear from C11. Alternatively, a memory barrier may be set between flag updates.

Edit: I initially tried to fix this by writing in the same memory location for all states. ninjalj noted that bitwise operations turned state operations into RMW, rather than loading and saving the original algorithm. Thus, atomic bitwise operations are required. C11 provides operators such as GCC with integrated modules. The algorithm below uses the built-in GCC modules, but is wrapped in macros, so they can easily be changed to some other implementations. However, a preferred solution is to modify the original algorithm.

 struct petersonslock_t { volatile unsigned state; }; typedef struct petersonslock_t petersonslock_t; #define ATOMIC_OR(x,v) __sync_or_and_fetch(&x, v) #define ATOMIC_AND(x,v) __sync_and_and_fetch(&x, v) petersonslock_t petersonslock () { petersonslock_t l = { 0x000000U }; return l; } void petersonslock_lock (petersonslock_t *l, int p) { assert(p == 0 || p == 1); unsigned mask = (p == 0) ? 0xFF0000 : 0x00FF00; ATOMIC_OR(l->state, (p == 0) ? 0x000100 : 0x010000); (p == 0) ? ATOMIC_OR(l->state, 0x000001) : ATOMIC_AND(l->state, 0xFFFF00); while ((l->state & mask) && (l->state & 0x0000FF) == !p) {} }; void petersonslock_unlock (petersonslock_t *l, int p) { assert(p == 0 || p == 1); ATOMIC_AND(l->state, (p == 0) ? 0xFF00FF : 0x00FFFF); }; 
+4
source

Peterson's algorithm cannot be correctly implemented on C99, as described in which he ordered memory locks on x86 .

Peterson's algorithm is as follows:

 LOCK: interested[id] = 1 interested[other] = 1 turn = other turn = id while turn == other while turn == id and interested[other] == 1 and interested[id] == 1 UNLOCK: interested[id] = 0 interested[other] = 0 

There are some hidden assumptions here. To begin with, each thread should note its interest in acquiring a castle before giving its turn. The refusal to turn should become visible to another thread that we are interested in acquiring a lock.

In addition, as in every lock, access to memory in the critical section is not possible when calling lock (), and is also not passed through unlock (). Ie: lock () should have at least semantics, and unlock () should have semantics at least.

In C11, the easiest way to achieve this would be to use a sequential sequential memory order, which makes the code work as if it were a simple alternation of threads executed in programmatic order ( WARNING: completely untested code , but this is similar to the example in Dmitriy V ' jukov Relacy Race Detector ):

 lock(int id) { atomic_store(&interested[id], 1); atomic_store(&turn, 1 - id); while (atomic_load(&turn) == 1 - id && atomic_load(&interested[1 - id]) == 1); } unlock(int id) { atomic_store(&interested[id], 0); } 

This ensures that the compiler does not make optimizations that violate the algorithm (by raising / lowering loads / storages for atomic operations) and emits the appropriate CPU instructions to ensure that the CPU does not break the algorithm either. The default memory model for C11 / C ++ 11 atomic operations that do not explicitly select a memory model is a sequential sequential memory model.

C11 / C ++ 11 also supports weaker memory models, allowing you to optimize the work as much as possible. The following is a C11 translation of the C ++ 11 translation from Anthony Williams' algorithm, originally written by Dmitry Vyukov in the syntax of his own race detector, Relacy Race Detector [petersons_lock_with_C ++ 0x_atomics] [the-inscrutable-c-memory model] . If this algorithm is incorrect, my error ( WARNING: also unverified code , but based on good code from Dmitry Vyukov and Anthony Williams):

 lock(int id) { atomic_store_explicit(&interested[id], 1, memory_order_relaxed); atomic_exchange_explicit(&turn, 1 - id, memory_order_acq_rel); while (atomic_load_explicit(&interested[1 - id], memory_order_acquire) == 1 && atomic_load_explicit(&turn, memory_order_relaxed) == 1 - id); } unlock(int id) { atomic_store_explicit(&interested[id], 0, memory_order_release); } 

Pay attention to the exchange with the semantics of receipt and release. Exchange is the atomic work of RMW. Atomic RMW operations always read the last saved value before writing to RMW mode. In addition, acquisitions at a nuclear facility that reads a record from a release at the same atomic facility (or any other write on that facility from a thread that issued or later write from any RMW atomic operation) creates synchronization with the relationship between issue and acquisition.

So, this operation is a synchronization point between threads, there is always synchronized with the relationship between the exchange in one thread and the last exchange, performed by any thread (or initialization of rotation, for the very first exchange).

So, we have a sequenced connection between the storage to interested[id] and the exchange from / to turn , synchronization - with the relationship between the two serial exchanges from / to turn and the sequence with the binding to between the exchange from / to turn and the interested[1 - id] load interested[1 - id] . This represents the relationship between access events up to interested[x] in different threads, with turn providing synchronization between threads. This forces all orders necessary for the algorithm to work.

So how was this done before C11? This is due to the use of compiler and magic specific to the processor. As an example, consider a fairly well-ordered x86. IIRC, all x86 loads acquire semantics, and all stores release semantics (preserving untimely moves, in SSE, is used precisely to achieve higher performance due to the unforeseen need to release CPU fencing to achieve consistency between processors). But this is not enough for the Peterson algorithm, since Bartosh Milevsky explains who-ordered-memory-fences-on-an-x86 , for the Peterson algorithm to work, we need to establish an order between access to turn and interested , failure to do this can lead to loading from interested[1 - id] before writing to interested[id] , which is bad.

So the way to do this in GCC / x86 would be ( WARNING: although I tested something similar to the following: it was actually a modified version of the wrong-implementation-of-petersons-algorithm code, testing never comes close to the correctness of multi-threaded code ):

 lock(int id) { interested[id] = 1; turn = 1 - id; __asm__ __volatile__("mfence"); do { __asm__ __volatile__("":::"memory"); } while (turn == 1 - id && interested[1 - id] == 1); } unlock(int id) { interested[id] = 0; } 

MFENCE prevents saving and loading stores and downloads to different memory addresses ordered. Otherwise, the entry in interested[id] can be queued in the storage buffer when interested[1 - id] loaded. For many current microarchitectures, a SFENCE may be enough, since it can be implemented as a buffer sink, but IIUC SFENCE does not need to be implemented in this way, and can simply prevent reordering between stores. So SFENCE may not be enough everywhere, and we need full MFENCE .

The compiler limit ( __asm__ __volatile__("":::"memory") ) prevents the compiler from deciding that it already knows the value of turn . We are telling the compiler that we are dumping memory, so all values ​​cached into registers must be reloaded from memory.

PS: I believe that this requires a final paragraph, but my brain is exhausted.

+5
source

Source: https://habr.com/ru/post/921003/


All Articles