I am usually not the only one to catch someone striving to achieve fast code: this is usually a very good exercise that leads to a better understanding of programming and faster code.
I will not argue either, but I can unequivocally state that the question of quick spin-lock of 3 instructions is a long one or a few more - at least in the x86 architecture - a useless pursuit.
That's why:
Call a spin lock with a typical code sequence
lock_variable DW 0 ; 0 <=> free mov ebx,offset lock_variable mov eax,1 xchg eax,[ebx] ; if eax contains 0 (no one owned it) you own the lock, ; if eax contains 1 (someone already does) you don't
Spin lock release is trivial
mov ebx,offset lock_variable mov dword ptr [ebx],0
The xchg instruction raises the lockout contact on the processor, which in fact means that I want the bus to remain for the next few clock cycles. This signal paves its way through caches and all the way to the slowest bus device, which is usually a PCI bus. When all bus control devices are complete, a lock (lock) signal is sent back. Then the actual exchange takes place. The problem is that the lock / locka sequence takes a VERY long time. The PCI bus can operate at a frequency of 33 MHz with several delay cycles. On a 3.3 GHz CPU, which means that each PCI bus cycle takes about a hundred processor cycles.
Typically, I assume that locking takes 300 to 3000 CPU cycles to complete, and in the end I don't know if I can have a lock. Thus, a few cycles that you can save using a “quick” spin lock will be a mirage, because no lock will look like the next one, it will depend on your situation in the bus in this short time.
________________ ________________ EDIT
I just read that spinlock is a "heavily used object." Well, you obviously don’t understand that spin-lock consumes a huge number of processor cycles during the time it is called. Or, in another way, every time you call it, you lose a significant part of your processing capabilities.
The trick when using the pins (or their larger brother, the critical section) is to use them as economically as possible, while maintaining the target program function. As a result, their use is universally easy, and as a result you get dull performance.
It's not all about writing quick code, or organizing your data. When you write "copying small structures between threads", you should understand that locking can take hundreds of times longer than the actual copying.
________________ ________________ EDIT
When you calculate the average lock time, it will probably say very little, because it is measured on your computer, which may not be the goal (which may have completely different bus usage characteristics). For your machine, the average value will be composed of separate very fast periods of time (when the tire development activity did not interfere) until very slow times (when the interference associated with the tires was significant).
You can enter a code that determines the fastest and slowest cases and calculate the coefficient to see how much the spin-lock time can vary.
________________ ________________ EDIT
May 2016 update.
Peter Cordes put forward the idea that “setting the lock in an unconfirmed case may make sense”, and the lock time of many hundreds of clock cycles does not occur on modern processors, except in situations where the lock variable is biased. I began to wonder if my previous test program, written in 32-bit Watcom C, could be complicated by WOW64, because it was running on a 64-bit OS: Windows 7.
So, I wrote a 64-bit program and compiled it with TDM gcc 5.3. The program uses the implicit version of the “XCHG r, m” bus lock command to lock and the simple “MOV m, r” assignment to unlock. In some locking options, the locking variable has been pre-tested to determine if it’s even possible to try to lock (using the simple “CMP r, m” sample, perhaps without thinking outside of L3). There he is:
// compiler flags used: // -O1 -m64 -mthreads -mtune=k8 -march=k8 -fwhole-program -freorder-blocks -fschedule-insns -falign-functions=32 -g3 -Wall -c -fmessage-length=0 #define CLASSIC_BUS_LOCK #define WHILE_PRETEST //#define SINGLE_THREAD typedef unsigned char u1; typedef unsigned short u2; typedef unsigned long u4; typedef unsigned int ud; typedef unsigned long long u8; typedef signed char i1; typedef short i2; typedef long i4; typedef int id; typedef long long i8; typedef float f4; typedef double f8; #define usizeof(a) ((ud)sizeof(a)) #define LOOPS 25000000 #include <stdio.h> #include <windows.h> #ifndef bool typedef signed char bool; #endif u8 CPU_rdtsc (void) { ud tickl, tickh; __asm__ __volatile__("rdtsc":"=a"(tickl),"=d"(tickh)); return ((u8)tickh << 32)|tickl; } volatile u8 bus_lock (volatile u8 * block, u8 value) { __asm__ __volatile__( "xchgq %1,%0" : "=r" (value) : "m" (*block), "0" (value) : "memory"); return value; } void bus_unlock (volatile u8 * block, u8 value) { __asm__ __volatile__( "movq %0,%1" : "=r" (value) : "m" (*block), "0" (value) : "memory"); } void rfence (void) { __asm__ __volatile__( "lfence" : : : "memory"); } void rwfence (void) { __asm__ __volatile__( "mfence" : : : "memory"); } void wfence (void) { __asm__ __volatile__( "sfence" : : : "memory"); } volatile bool LOCK_spinlockPreTestIfFree (const volatile u8 *lockVariablePointer) { return (bool)(*lockVariablePointer == 0ull); } volatile bool LOCK_spinlockFailed (volatile u8 *lockVariablePointer) { return (bool)(bus_lock (lockVariablePointer, 1ull) != 0ull); } void LOCK_spinlockLeave (volatile u8 *lockVariablePointer) { *lockVariablePointer = 0ull; } static volatile u8 lockVariable = 0ull, lockCounter = 0ull; static volatile i8 threadHold = 1; static u8 tstr[4][32]; /* 32*8=256 bytes for each thread parameters should result in them residing in different cache lines */ struct LOCKING_THREAD_STRUCTURE { u8 numberOfFailures, numberOfPreTests; f8 clocksPerLock, failuresPerLock, preTestsPerLock; u8 threadId; HANDLE threadHandle; ud idx; } *lts[4] = {(void *)tstr[0], (void *)tstr[1], (void *)tstr[2], (void *)tstr[3]}; DWORD WINAPI locking_thread (struct LOCKING_THREAD_STRUCTURE *ltsp) { ud n = LOOPS; u8 clockCycles; SetThreadAffinityMask (ltsp->threadHandle, 1ull<<ltsp->idx); while (threadHold) {} clockCycles = CPU_rdtsc (); while (n) { Sleep (0); #ifdef CLASSIC_BUS_LOCK while (LOCK_spinlockFailed (&lockVariable)) {++ltsp->numberOfFailures;} #else #ifdef WHILE_PRETEST while (1) { do { ++ltsp->numberOfPreTests; } while (!LOCK_spinlockPreTestIfFree (&lockVariable)); if (!LOCK_spinlockFailed (&lockVariable)) break; ++ltsp->numberOfFailures; } #else while (1) { ++ltsp->numberOfPreTests; if (LOCK_spinlockPreTestIfFree (&lockVariable)) { if (!LOCK_spinlockFailed (&lockVariable)) break; ++ltsp->numberOfFailures; } } #endif #endif ++lockCounter; LOCK_spinlockLeave (&lockVariable); #ifdef CLASSIC_BUS_LOCK while (LOCK_spinlockFailed (&lockVariable)) {++ltsp->numberOfFailures;} #else #ifdef WHILE_PRETEST while (1) { do { ++ltsp->numberOfPreTests; } while (!LOCK_spinlockPreTestIfFree (&lockVariable)); if (!LOCK_spinlockFailed (&lockVariable)) break; ++ltsp->numberOfFailures; } #else while (1) { ++ltsp->numberOfPreTests; if (LOCK_spinlockPreTestIfFree (&lockVariable)) { if (!LOCK_spinlockFailed (&lockVariable)) break; ++ltsp->numberOfFailures; } } #endif #endif --lockCounter; LOCK_spinlockLeave (&lockVariable); n-=2; } clockCycles = CPU_rdtsc ()-clockCycles; ltsp->clocksPerLock = (f8)clockCycles/ (f8)LOOPS; ltsp->failuresPerLock = (f8)ltsp->numberOfFailures/(f8)LOOPS; ltsp->preTestsPerLock = (f8)ltsp->numberOfPreTests/(f8)LOOPS; //rwfence (); ltsp->idx = 4u; ExitThread (0); return 0; } int main (int argc, char *argv[]) { u8 processAffinityMask, systemAffinityMask; memset (tstr, 0u, usizeof(tstr)); lts[0]->idx = 3; lts[1]->idx = 2; lts[2]->idx = 1; lts[3]->idx = 0; GetProcessAffinityMask (GetCurrentProcess(), &processAffinityMask, &systemAffinityMask); SetPriorityClass (GetCurrentProcess(), HIGH_PRIORITY_CLASS); SetThreadAffinityMask (GetCurrentThread (), 1ull); lts[0]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[0], 0, (void *)<s[0]->threadId);
The program was launched on a computer running DELL i5-4310U with DDR3-800, 2 cores / 2 HT with 2.7 GHz and a shared L3 cache.
To begin with, the impact of WOW64 was negligible.
One thread performing insecure locking / unlocking was able to do this every 110 cycles. Setting up an unmanaged lock is useless: any code added to improve one XCHG instruction will make it slower.
With four HTs bombarding the lock variable with lock attempts, the situation is radically changing. The time required to achieve a successful lock goes into 994 cycles, of which a significant part can be attributed to failed lock attempts 2.2. In other words, in a situation with a high level of competition, an average of 3.2 locks should be taken to achieve successful blocking. Obviously, 110 cycles did not become 110 * 3.2, but closer to 110 * 9. Thus, other mechanisms operate here, like tests on an old machine. In addition, an average of 994 cycles span the range between 716 and 1157
Pre-testing blocking options required about 95% of the cycles that were requested using the simplest version (XCHG). On average, they would perform 17 CMPs to find the ability to complete 1.75 locks, of which 1 was successful. I recommend using preliminary testing not only because it is faster: it imposes less load on the bus locking mechanism (3.2-1.75 = 1.45 fewer blocking attempts), although this slightly increases the complexity.