Not good enough? Should I, for example, use the register keyword in C?
register is a meaningless allusion to modern optimizing compilers.
I think a simple spinlock that does not have any serious / obvious performance issues on x86 is something like this. Of course, the real implementation will use a system call (for example, Linux futex ) after turning for a while, and the unlock will have to check if it should notify the waiters with another system call. It is important; you don’t want to spin forever, wasting CPU time (and energy / heat) without doing anything. But conceptually, this is the spin part of the spin lock before you take the fallback path. This is an important part of how easy locking is . (Only an attempt to lock once before calling the kernel would be the right choice, not a rotation at all.)
Implement as much as you want in inline asm or, preferably, using C11 stdatomic , as implemented by the semaphore .
;;; UNTESTED ;;;;;;;; ;;; TODO: **IMPORTANT** fall back to OS-supported sleep/wakeup after spinning some ; first arg in rdi, in the AMD64 SysV ABI ;;;;;void spin_lock (volatile char *lock) global spin_unlock spin_unlock: ;; debug: check that the old value was non-zero. double-unlocking is a nasty bug mov byte [rdi], 0 ret ;; The store has release semantics, but not sequential-consistency (which you'd get from an xchg or something), ;; because acquire/release is enough to protect a critical section (hence the name) ;;;;;void spin_unlock(volatile char *lock) global spin_lock spin_lock: cmp byte [rdi], 0 ; avoid writing to the cache line if we don't own the lock: should speed up the other thread unlocking jnz .spinloop mov al, 1 ; only need to do this the first time, otherwise we know al is non-zero .retry: xchg al, [rdi] test al,al ; check if we actually got the lock jnz .spinloop ret ; no taken branches on the fast-path .spinloop: pause ; very old CPUs decode it as REP NOP, which is fine cmp byte [rdi], 0 ; To get a compiler to do this in C++11, use a memory_order_acquire load jnz .spinloop jmp .retry
If you used the atomic flag lock bts , you can use lock bts (test and set) for the equivalent of xchg-with-1. You can rotate on bt or test . To unlock, you need lock btr , not just btr , because it will be a non-atomic read-modify-write byte or even containing 32 bits.
With byte or word locking, you don’t even need a lock ed operation to unlock; enough release semantics . glibc pthread_spin_unlock does the same thing as my unlock function: a simple store.
This avoids writing to the lock if we see that it is already locked. This avoids the invalidation of the cache line in the L1 kernel in which the thread that owns it runs, so it can return to "Modified" ( MESIF or MOESI ) with less cache coherence delay during unlocking.
We also do not flood the CPU with lock ed operations in a loop. I'm not sure how much this slows down overall, but 10 threads waiting for the same spinlock will keep the memory arbitration equipment pretty busy. This can slow down the thread that holds the lock, or other unrelated threads in the system, while they use other locks or memory in general.
PAUSE also necessary to avoid incorrect speculation about the storage order of the CPU memory. You exit the loop only when the memory you are reading has been changed by another kernel. However, we do not want PAUSE in a consistent case. On Skylake, PAUSE waits a lot more time, for example, ~ 100cycles IIRC, so you should definitely leave the spinloop separate from the initial check to unlock.
I am sure that Intel and AMD optimization guides talk about this, see x86 for this and many other links.