Although this is not a complete solution for all applications, I found a way that supports the necessary basic functionality and passes at least some rudimentary multi-threaded tests:
#define _Atomic(T) struct { volatile __typeof__(T) __val; } typedef _Atomic(int) atomic_int; #define atomic_load(object) \ __sync_fetch_and_add(&(object)->__val, 0) #define atomic_store(object, desired) do { \ __sync_synchronize(); \ (object)->__val = (desired); \ __sync_synchronize(); \ } while (0)
Calls __sync_synchronize and __sync_fetch_and_add are required, otherwise the connection between the threads fails (I did not test the removal of only one of them, I just tested the removal of both).
I am not very sure, however, that this solution works in all cases. I found it from https://gist.github.com/nhatminhle/5181506 , where the author does not recommend it for older versions of GCC.
In theory, you can also use a mutex. However, mutexes have lower performance than atomistic.
Edit:
It is also possible to implement atomic_store as follows:
#define atomic_store(object, desired) do { \ for (;;) \ { \ __typeof__((object)->__val) oldval = atomic_load(object); \ if (__sync_bool_compare_and_swap(&(object)->__val, oldval, desired)) \ { \ break; \ } \ } \ } while (0)
However, this led to a decrease in productivity from 139280.5 op / sec (standard deviation 1799.6 op / sec) to 131805.6 op / sec (standard deviation 986.03 op / sec). Thus, performance degradation is statistically significant.
Edit 2:
The loop approach has the following build code:
.globl signal_completion .type signal_completion, @function signal_completion: .LFB18: leaq 4(%rdi), %rcx .L42: xorl %eax, %eax lock xaddl %eax, (%rcx) movl $1, %edx movl %eax, -4(%rsp) movl -4(%rsp), %eax lock cmpxchgl %edx, (%rcx) jne .L42 rep ; ret .LFE18: .size signal_completion, .-signal_completion .p2align 4,,15
While the __sync_synchronize method has the following code:
.globl signal_completion .type signal_completion, @function signal_completion: .LFB18: movl $1, 4(%rdi) ret .LFE18: .size signal_completion, .-signal_completion .p2align 4,,15
... and on a machine that has stdatomic.h, it compiles:
.globl signal_completion .type signal_completion, @function signal_completion: .LFB43: .cfi_startproc movl $1, 4(%rdi) mfence ret .cfi_endproc .LFE43: .size signal_completion, .-signal_completion
So the only thing I am missing is mfence. I assume that it can be added using a simple built-in assembly, for example:
asm volatile ("mfence" ::: "memory");
... is placed after the second __sync_synchronize () in the definition of atom_store.
Edit 3:
Apparently, __sync_fetch_and_add is not optimized since the loop that polled the variable has this assembly output:
.L29: xorl %eax, %eax lock xaddl %eax, (%rdi) testl %eax, %eax je .L29
Instead of this:
#define atomic_load(object) ((object)->__val)
You'll get:
.L29: movl (%rdi), %eax testl %eax, %eax je .L29
which is equivalent to building on a machine supporting stdatomic.h:
.L38: movl (%rdi), %eax testl %eax, %eax je .L38
Oddly enough, the __sync_fetch_and_add option works faster on my computer and in my test, although it has more complex code. Strange world, right?