C and low-level semaphore implementation

Question

C and low-level semaphore implementation

I was thinking about how to implement semaphores (not binary) using asm code as little as possible.
I was not able to think and write this without using the mutex, so here is the best thing I could do so far:

Global:

#include <stdlib.h> #include <pthread.h> #include <stdatomic.h> #include <stdbool.h> typedef struct { atomic_ullong value; pthread_mutex_t *lock_op; bool ready; } semaphore_t; typedef struct { atomic_ullong value; pthread_mutex_t lock_op; bool ready; } static_semaphore_t; /* use with static_semaphore_t */ #define SEMAPHORE_INITIALIZER(value) = {value, PTHREAD_MUTEX_INITIALIZER, true}

Functions:

 bool semaphore_init(semaphore_t *semaphore, unsigned long long init_value) { if(semaphore->ready) if(!(semaphore->lock_op = \ calloc(1, sizeof(pthread_mutex_t)))) return false; else pthread_mutex_destroy(semaphore->lock_op); if(pthread_mutex_init(semaphore->lock_op, NULL)) return false; semaphore->value = init_value; semaphore->ready = true; return true; } bool semaphore_wait(semaphore_t *semaphore) { if(!semaphore->ready) return false; pthread_mutex_lock(&(semaphore->lock_op)); while(!semaphore->value) __asm__ __volatile__("nop"); (semaphore->value)--; pthread_mutex_unlock(&(semaphore->lock_op)); return true; } bool semaphore_post(semaphore_t *semaphore) { if(!semaphore->ready) return false; atomic_fetch_add(&(semaphore->value), (unsigned long long) 1); return true; }

Is it possible to implement a semaphore using only a few lines, with atomic built-in or directly in the assembly (for example, lock cmpxchg )?

Looking at the sem_t structure from <bits/sempahore.h> included by <semaphore.h> , it seems to me that it was chosen in a completely different way ...

 typedef union { char __size[__SIZEOF_SEM_T]; long int __align; } sem_t;

UPDATE:

@PeterCordes definitely proposed a much better solution using atoms, without a mutex, doing checks directly on the value of the semaphore.

I still want to better understand how to improve the code in terms of performance, taking into account the benefits of the built-in pause or kernel call functions that avoid processor loss while expecting critical resources to be available.

It would also be nice to have a standard implementation of mutexes and non-binary semaphores for comparison.
From futex (7) I read: “The Linux kernel provides futexes (“ Fast user space mutexes ”) as a building block for quickly locking user space and semaphores. Futexes are very simple and well suited for creating higher-level abstractions such as mutexes, variables conditions, blocking read and write, barriers and semaphores. "

+6

c assembly multithreading mutex semaphore

JumpAlways Mar 18 '16 at 20:59

source share

1 answer

Peter Cordes · Accepted Answer · 2016-03-19 02:00

See part of the way down for my minimal implementation of a naive semaphore, which probably works. It compiles and looks correct for x86. I think this is correct in general for any C11 implementation.

IIRC, it is possible to implement a counting lock (aka semaphore) with only one integer , which you access through atomic operations. This wikipedia link even provides algorithms for up / down . You do not need a separate mutex. If atomic_ullong requires a mutex to support atomic increment / decrement on the target CPU, it will include one. (It could be on a 32-bit x86, or the implementation uses a slow cmpxchg8 instead of a fast lock xadd . Is the 32-bit counter really too small for your semaphore? Because 64-bit atomization will be slower on 32-bit machines.)

The definition of the <bits/sempahore.h> clearly just an opaque POD type with the correct size, not indicative of the actual implementation.

As @David Schwartz says, this is a crazy order to implement your own lock for practical use if you are not an expert. This can be an interesting way to learn about atomic operations and find out what's under the hood in standard implementations. Note that his caution is that blocking implementations are hard to verify. You can write code that works for your test case on your hardware with code from the current version of your compiler with your chosen compilation options ...

The logical ready boolean is just a waste of space. If you can correctly initialize the ready flag so that it is meaningful for functions to look at it, you can initialize other fields to the normal initial state.

There are several other problems in your code that I noticed:

 #define SEMAPHORE_INITIALIZER(value) = {value, PTHREAD_MUTEX_INITIALIZER, true}; static_semaphore_t my_lock = SEMAPHORE_INITIALIZER(1); // expands to my_lock = = {1, PTHREAD_MUTEX_INITIALIZER, true};; // you should leave out the = and ; in the macro def so it works like a value

Using dynamically allocated pthread_mutex_t *lock_op just plain stupid. Use a value, not a pointer. Most locking functions use a mutex, so an extra level of indirection just slows down. It would be much better if the memory was near the counter. Mutex does not require much space.

 while(!semaphore->value) __asm__ __volatile__("nop");

We want this cycle to not lose power and slow down other threads, and even other logical threads, sharing the same core with a hyper-thread.

nop does not make the busy cycle less CPU intensive. Even with a hyper-thread, this probably does not matter for x86, because the whole body of the loop seems to fit 4 times, and therefore problems at one iteration per clock cycle, whether there is nop there or not. nop does not need an executive module, so at least it won’t hurt. This spin loop happens with a mutex that seems silly. Thus, the first waiter will get to this spin cycle, and the waiters will then rotate on the mutex.

Here is my naive semaphore implementation using only C11 atomic operators

I think this is a good implementation that achieves very limited goals: to be correct and small (source code and machine code), and not to use other actual blocking primitives. There are main areas that I’m not even trying to solve (for example, justice / starvation, giving way to the processor to other threads, maybe other things).

See asm output on godbolt : only 12 x86 insns for down , 2 for up (including ret s). Godbolt non-x86 compilers (gcc 4.8 for ARM / ARM64 / PPC) are too old to support C11 <stdatomic.h> . (However, they have C ++ std::atomic ). Therefore, unfortunately, I can’t easily check the asm output on non-x86.

 #include <stdatomic.h> #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) typedef struct { atomic_int val; // int is plenty big. Must be signed, so an extra decrement doesn't make 0 wrap to >= 1 } naive_sem_t; #if defined(__i386__) || defined(__x86_64__) #include <immintrin.h> static inline void spinloop_body(void) { _mm_pause(); } // PAUSE is "rep nop" in asm output #else static inline void spinloop_body(void) { } #endif void sem_down(naive_sem_t *sem) { while (1) { while (likely(atomic_load_explicit(&(sem->val), memory_order_acquire ) < 1)) spinloop_body(); // wait for a the semaphore to be available int tmp = atomic_fetch_add_explicit( &(sem->val), -1, memory_order_acq_rel ); // try to take the lock. Might only need mo_acquire if (likely(tmp >= 1)) break; // we successfully got the lock else // undo our attempt; another thread decrement happened first atomic_fetch_add_explicit( &(sem->val), 1, memory_order_release ); // could be "relaxed", but we're still busy-waiting and want other thread to see this ASAP } } // note the release, not seq_cst. Use a stronger ordering parameter if you want it to be a full barrier. void sem_up(naive_sem_t *sem) { atomic_fetch_add_explicit(&(sem->val), 1, memory_order_release); }

The trick here is that for val normal enough temporarily ; which just makes other threads spin. Also note that fetch_add , which is a single atomic operation, is the key . It returns the old value, so we can detect when val was made by another thread between the load of the while loop and fetch_add. (Note that we do not need to verify that tmp is == for the load of the while: loop is fine if another up stream sets a semaphore between the load and fetch_add. This is the advantage of using fetch_add instead of CMPXCHG).

The spin cycle atomic_load is simply a performance optimization, since all the waiters perform atomic read-modify-write on val . (Although with many waiters trying to decify and then cancel with inc, when the waiter has ever seen that the lock is unlocked, it can be very rare).

The actual implementation will have special things for more platforms than just x86. For x86, it's probably more than just the PAUSE command inside the spinloop. This is still just a toy example of a fully portable C11 implementation. PAUSE , apparently, helps to avoid erroneous assumptions about memory ordering, so the processor works more efficiently after exiting the rotation cycle. PAUSE is not the same as casting a logical processor in the OS to execute another thread. It also has nothing to do with the correctness and choice of parameters memory_order_??? .

The real implementation is likely to throw the CPU into the OS after a number of iterations of rotation ( sched_yield(2) or, most likely, the futex system call, see below). It is possible to use x86 MONITOR / MWAIT to be even more convenient for hyperthreads; I'm not sure. I never really implemented a lock myself, I just see it all in the x86 insn link when looking for other insns.

As mentioned earlier, the x86 lock xadd implements fetch_add (with the semantics of consistent consistency, since lock ed statements are always a complete memory barrier). On non-x86, using only the + release semantics for fetch_add, not fully consistent consistency can allow more efficient code. I'm not sure, but using only acquire would very likely allow more efficient ARM64 code. I think we need to acquire on fetch_add, not acq_rel , but I'm not sure. On x86, there will be no difference in code, since lock ed statements are the only way to do atomic read-modify-write, so even relaxed will be the same as seq_cst (except for reordering compile time .)

If you want to get a processor instead of spinning, you need a system call (as people said). Obviously, most of the work has gone to ensure that standard library locking is as efficient as possible on Linux. There are special system calls that help the kernel wake up the correct threads (threads) when the lock is released and they are not easy to use. From futex(7) :

NOTES
To repeat, bare futexes are not intended to be a simple abstraction for end users. (There is no system call wrapper for this function in glibc.) It is expected that the executors will be competent to build and read the sources of the futex user space library below.

justice / hunger (which my naive implementation ignores)

As mentioned in a wikipedia article, some kind of wake-up queue is a good idea, so the same thread does not receive a semaphore every time. (A code that quickly locks after being released typically has a release thread that receives the lock while other threads are still sleeping).

This is another important advantage for kernel collaboration in the process ( futex ).

C and low-level semaphore implementation

Here is my naive semaphore implementation using only C11 atomic operators

justice / hunger (which my naive implementation ignores)

More articles: