Locks around memory manipulation via built-in assembly

I am new to low-level materials, so I completely don’t notice what problems you will encounter there, and I’m not even sure if I understand the term “atomic”. Now I'm trying to make simple atomic locks when manipulating memory through an extended assembly. What for? For the sake of curiosity. I know that I am reinventing the wheel here and possibly simplifying the whole process.

Question? Does the code that I present here represent the purpose of manipulating memory as thread safe as well as reentrant?

  • If it works, why?
  • If this does not work, why?
  • Not good enough? Should I, for example, use the register keyword in C?

What I just want to do ...

  • Lock before manipulating the memory.
  • After manipulating the memory, open.

The code:

volatile int atomic_gate_memory = 0; static inline void atomic_open(volatile int *gate) { asm volatile ( "wait:\n" "cmp %[lock], %[gate]\n" "je wait\n" "mov %[lock], %[gate]\n" : [gate] "=m" (*gate) : [lock] "r" (1) ); } static inline void atomic_close(volatile int *gate) { asm volatile ( "mov %[lock], %[gate]\n" : [gate] "=m" (*gate) : [lock] "r" (0) ); } 

Then something like:

 void *_malloc(size_t size) { atomic_open(&atomic_gate_memory); void *mem = malloc(size); atomic_close(&atomic_gate_memory); return mem; } #define malloc(size) _malloc(size) 

.. same for calloc, realloc, free and fork (for linux).

 #ifdef _UNISTD_H int _fork() { pid_t pid; atomic_open(&atomic_gate_memory); pid = fork(); atomic_close(&atomic_gate_memory); return pid; } #define fork() _fork() #endif 

After loading the stackframe for atom_open, objdump generates:

 00000000004009a7 <wait>: 4009a7: 39 10 cmp %edx,(%rax) 4009a9: 74 fc je 4009a7 <wait> 4009ab: 89 10 mov %edx,(%rax) 

Also, given the disassembly above; can I assume that I am doing an atomic operation, because this is only one instruction?

+2
c assembly memory-management memory locking
May 15 '16 at 17:37
source share
1 answer

Not good enough? Should I, for example, use the register keyword in C?

register is a meaningless allusion to modern optimizing compilers.




I think a simple spinlock that does not have any serious / obvious performance issues on x86 is something like this. Of course, the real implementation will use a system call (for example, Linux futex ) after turning for a while, and the unlock will have to check if it should notify the waiters with another system call. It is important; you don’t want to spin forever, wasting CPU time (and energy / heat) without doing anything. But conceptually, this is the spin part of the spin lock before you take the fallback path. This is an important part of how easy locking is . (Only an attempt to lock once before calling the kernel would be the right choice, not a rotation at all.)

Implement as much as you want in inline asm or, preferably, using C11 stdatomic , as implemented by the semaphore .

 ;;; UNTESTED ;;;;;;;; ;;; TODO: **IMPORTANT** fall back to OS-supported sleep/wakeup after spinning some ; first arg in rdi, in the AMD64 SysV ABI ;;;;;void spin_lock (volatile char *lock) global spin_unlock spin_unlock: ;; debug: check that the old value was non-zero. double-unlocking is a nasty bug mov byte [rdi], 0 ret ;; The store has release semantics, but not sequential-consistency (which you'd get from an xchg or something), ;; because acquire/release is enough to protect a critical section (hence the name) ;;;;;void spin_unlock(volatile char *lock) global spin_lock spin_lock: cmp byte [rdi], 0 ; avoid writing to the cache line if we don't own the lock: should speed up the other thread unlocking jnz .spinloop mov al, 1 ; only need to do this the first time, otherwise we know al is non-zero .retry: xchg al, [rdi] test al,al ; check if we actually got the lock jnz .spinloop ret ; no taken branches on the fast-path .spinloop: pause ; very old CPUs decode it as REP NOP, which is fine cmp byte [rdi], 0 ; To get a compiler to do this in C++11, use a memory_order_acquire load jnz .spinloop jmp .retry 



If you used the atomic flag lock bts , you can use lock bts (test and set) for the equivalent of xchg-with-1. You can rotate on bt or test . To unlock, you need lock btr , not just btr , because it will be a non-atomic read-modify-write byte or even containing 32 bits.

With byte or word locking, you don’t even need a lock ed operation to unlock; enough release semantics . glibc pthread_spin_unlock does the same thing as my unlock function: a simple store.




This avoids writing to the lock if we see that it is already locked. This avoids the invalidation of the cache line in the L1 kernel in which the thread that owns it runs, so it can return to "Modified" ( MESIF or MOESI ) with less cache coherence delay during unlocking.

We also do not flood the CPU with lock ed operations in a loop. I'm not sure how much this slows down overall, but 10 threads waiting for the same spinlock will keep the memory arbitration equipment pretty busy. This can slow down the thread that holds the lock, or other unrelated threads in the system, while they use other locks or memory in general.

PAUSE also necessary to avoid incorrect speculation about the storage order of the CPU memory. You exit the loop only when the memory you are reading has been changed by another kernel. However, we do not want PAUSE in a consistent case. On Skylake, PAUSE waits a lot more time, for example, ~ 100cycles IIRC, so you should definitely leave the spinloop separate from the initial check to unlock.

I am sure that Intel and AMD optimization guides talk about this, see x86 for this and many other links.

+5
May 16 '16 at 3:25
source share



All Articles