Fastest built-in spin lock

Question

Fastest built-in spin lock

I am writing a multithreaded application in C ++ where performance is critical. I need to use a lot of locking when copying small structures between threads, for this I chose to use pins.

I did some research and speed testing on this, and I found that most implementations are about equally fast:

Microsoft CRITICAL_SECTION, with SpinCount set to 1000, has about 140 time units.
Implementation of this algorithm with Microsoft InterlockedCompareExchange totals about 95 units of time
Ive also tried using some built-in assembly with __asm {} using something like this code , and it totals about 70 units of time , but I'm not sure if the correct memory barrier was created.

Edit: the time indicated here is the time taken by 2 threads to lock and unlock the spinlock 1,000,000 times.

I know that this is not a big difference, but since spinlock is a heavily used object, you might think that programmers would agree on the fastest way to do a spinlock. However, a Google search leads to many different approaches. I would think that the above method would be the fastest if implemented using the built-in assembly and using the CMPXCHG8B command instead of comparing 32-bit registers. In addition, it is necessary to consider memory barriers, this can be done by LOCK CMPXHG8B (I think?) , Which guarantees "exclusive rights" for shared memory between the cores. Finally, [some suggest] that busy should be followed by NOP: REP , which will allow Hyper-threading processors to switch to another thread, but I'm not sure if this is true or not?

From my performance test of the various studs, it is clear that there is not much difference, but for purely academic purposes I would like to know which one is the fastest. However, since I have very limited experience in assembly language and with memory barriers, I would be happy if someone could write assembly code for the last example that I provided LOCK CMPXCHG8B and the appropriate memory barriers in the following pattern:

 __asm { spin_lock: ;locking code. spin_unlock: ;unlocking code. }

+23

c ++ assembly x86 memory-barriers spinlock

sigvardsen Aug 14 '12 at 19:26

source share

5 answers

Although there is already an accepted answer, there are a few things that were missed that could be used to improve all the answers taken from this Intel article, all of the above quick lock execution :

Hide on a readable instruction rather than an atomic instruction, this avoids unnecessary blocking of the bus, especially on strongly permitted locks.
Use rollback for heavily contested locks
Embed a lock, preferably with built-in ones for compilers, where inline asm is harmful (mainly MSVC).

+9

Necrolis Aug 16

source share

Wikipedia has a nice article on spindlock, here is the x86 implementation

http://en.wikipedia.org/wiki/Spinlock#Example_implementation

Note that their implementation does not use the "lock" prefix, since it is redundant on x86 for the "xchg" instruction - it implicitly has locking semantics, as discussed in this discussion in Stackoverflow:

In x86 multi-core, LOCK is needed as a prefix for XCHG?

REP: NOP is an alias for the PAUSE statement, you can learn more about it here.

How does the x86 pause instruction work in spinlock * and *, can it be used in other scripts?

On the issue of memory barriers, here is everything you might want to know

Memory Limits: Hardware for Software Hackers Paul E. McKenney

http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07c.pdf

+5

amdn Aug 14 2018-12-12T00:

source share

I am usually not the only one to catch someone striving to achieve fast code: this is usually a very good exercise that leads to a better understanding of programming and faster code.

I will not argue either, but I can unequivocally state that the question of quick spin-lock of 3 instructions is a long one or a few more - at least in the x86 architecture - a useless pursuit.

That's why:

Call a spin lock with a typical code sequence

 lock_variable DW 0 ; 0 <=> free mov ebx,offset lock_variable mov eax,1 xchg eax,[ebx] ; if eax contains 0 (no one owned it) you own the lock, ; if eax contains 1 (someone already does) you don't

Spin lock release is trivial

 mov ebx,offset lock_variable mov dword ptr [ebx],0

The xchg instruction raises the lockout contact on the processor, which in fact means that I want the bus to remain for the next few clock cycles. This signal paves its way through caches and all the way to the slowest bus device, which is usually a PCI bus. When all bus control devices are complete, a lock (lock) signal is sent back. Then the actual exchange takes place. The problem is that the lock / locka sequence takes a VERY long time. The PCI bus can operate at a frequency of 33 MHz with several delay cycles. On a 3.3 GHz CPU, which means that each PCI bus cycle takes about a hundred processor cycles.

Typically, I assume that locking takes 300 to 3000 CPU cycles to complete, and in the end I don't know if I can have a lock. Thus, a few cycles that you can save using a “quick” spin lock will be a mirage, because no lock will look like the next one, it will depend on your situation in the bus in this short time.

________________ ________________ EDIT

I just read that spinlock is a "heavily used object." Well, you obviously don’t understand that spin-lock consumes a huge number of processor cycles during the time it is called. Or, in another way, every time you call it, you lose a significant part of your processing capabilities.

The trick when using the pins (or their larger brother, the critical section) is to use them as economically as possible, while maintaining the target program function. As a result, their use is universally easy, and as a result you get dull performance.

It's not all about writing quick code, or organizing your data. When you write "copying small structures between threads", you should understand that locking can take hundreds of times longer than the actual copying.

________________ ________________ EDIT

When you calculate the average lock time, it will probably say very little, because it is measured on your computer, which may not be the goal (which may have completely different bus usage characteristics). For your machine, the average value will be composed of separate very fast periods of time (when the tire development activity did not interfere) until very slow times (when the interference associated with the tires was significant).

You can enter a code that determines the fastest and slowest cases and calculate the coefficient to see how much the spin-lock time can vary.

________________ ________________ EDIT

May 2016 update.

Peter Cordes put forward the idea that “setting the lock in an unconfirmed case may make sense”, and the lock time of many hundreds of clock cycles does not occur on modern processors, except in situations where the lock variable is biased. I began to wonder if my previous test program, written in 32-bit Watcom C, could be complicated by WOW64, because it was running on a 64-bit OS: Windows 7.

So, I wrote a 64-bit program and compiled it with TDM gcc 5.3. The program uses the implicit version of the “XCHG r, m” bus lock command to lock and the simple “MOV m, r” assignment to unlock. In some locking options, the locking variable has been pre-tested to determine if it’s even possible to try to lock (using the simple “CMP r, m” sample, perhaps without thinking outside of L3). There he is:

 // compiler flags used: // -O1 -m64 -mthreads -mtune=k8 -march=k8 -fwhole-program -freorder-blocks -fschedule-insns -falign-functions=32 -g3 -Wall -c -fmessage-length=0 #define CLASSIC_BUS_LOCK #define WHILE_PRETEST //#define SINGLE_THREAD typedef unsigned char u1; typedef unsigned short u2; typedef unsigned long u4; typedef unsigned int ud; typedef unsigned long long u8; typedef signed char i1; typedef short i2; typedef long i4; typedef int id; typedef long long i8; typedef float f4; typedef double f8; #define usizeof(a) ((ud)sizeof(a)) #define LOOPS 25000000 #include <stdio.h> #include <windows.h> #ifndef bool typedef signed char bool; #endif u8 CPU_rdtsc (void) { ud tickl, tickh; __asm__ __volatile__("rdtsc":"=a"(tickl),"=d"(tickh)); return ((u8)tickh << 32)|tickl; } volatile u8 bus_lock (volatile u8 * block, u8 value) { __asm__ __volatile__( "xchgq %1,%0" : "=r" (value) : "m" (*block), "0" (value) : "memory"); return value; } void bus_unlock (volatile u8 * block, u8 value) { __asm__ __volatile__( "movq %0,%1" : "=r" (value) : "m" (*block), "0" (value) : "memory"); } void rfence (void) { __asm__ __volatile__( "lfence" : : : "memory"); } void rwfence (void) { __asm__ __volatile__( "mfence" : : : "memory"); } void wfence (void) { __asm__ __volatile__( "sfence" : : : "memory"); } volatile bool LOCK_spinlockPreTestIfFree (const volatile u8 *lockVariablePointer) { return (bool)(*lockVariablePointer == 0ull); } volatile bool LOCK_spinlockFailed (volatile u8 *lockVariablePointer) { return (bool)(bus_lock (lockVariablePointer, 1ull) != 0ull); } void LOCK_spinlockLeave (volatile u8 *lockVariablePointer) { *lockVariablePointer = 0ull; } static volatile u8 lockVariable = 0ull, lockCounter = 0ull; static volatile i8 threadHold = 1; static u8 tstr[4][32]; /* 32*8=256 bytes for each thread parameters should result in them residing in different cache lines */ struct LOCKING_THREAD_STRUCTURE { u8 numberOfFailures, numberOfPreTests; f8 clocksPerLock, failuresPerLock, preTestsPerLock; u8 threadId; HANDLE threadHandle; ud idx; } *lts[4] = {(void *)tstr[0], (void *)tstr[1], (void *)tstr[2], (void *)tstr[3]}; DWORD WINAPI locking_thread (struct LOCKING_THREAD_STRUCTURE *ltsp) { ud n = LOOPS; u8 clockCycles; SetThreadAffinityMask (ltsp->threadHandle, 1ull<<ltsp->idx); while (threadHold) {} clockCycles = CPU_rdtsc (); while (n) { Sleep (0); #ifdef CLASSIC_BUS_LOCK while (LOCK_spinlockFailed (&lockVariable)) {++ltsp->numberOfFailures;} #else #ifdef WHILE_PRETEST while (1) { do { ++ltsp->numberOfPreTests; } while (!LOCK_spinlockPreTestIfFree (&lockVariable)); if (!LOCK_spinlockFailed (&lockVariable)) break; ++ltsp->numberOfFailures; } #else while (1) { ++ltsp->numberOfPreTests; if (LOCK_spinlockPreTestIfFree (&lockVariable)) { if (!LOCK_spinlockFailed (&lockVariable)) break; ++ltsp->numberOfFailures; } } #endif #endif ++lockCounter; LOCK_spinlockLeave (&lockVariable); #ifdef CLASSIC_BUS_LOCK while (LOCK_spinlockFailed (&lockVariable)) {++ltsp->numberOfFailures;} #else #ifdef WHILE_PRETEST while (1) { do { ++ltsp->numberOfPreTests; } while (!LOCK_spinlockPreTestIfFree (&lockVariable)); if (!LOCK_spinlockFailed (&lockVariable)) break; ++ltsp->numberOfFailures; } #else while (1) { ++ltsp->numberOfPreTests; if (LOCK_spinlockPreTestIfFree (&lockVariable)) { if (!LOCK_spinlockFailed (&lockVariable)) break; ++ltsp->numberOfFailures; } } #endif #endif --lockCounter; LOCK_spinlockLeave (&lockVariable); n-=2; } clockCycles = CPU_rdtsc ()-clockCycles; ltsp->clocksPerLock = (f8)clockCycles/ (f8)LOOPS; ltsp->failuresPerLock = (f8)ltsp->numberOfFailures/(f8)LOOPS; ltsp->preTestsPerLock = (f8)ltsp->numberOfPreTests/(f8)LOOPS; //rwfence (); ltsp->idx = 4u; ExitThread (0); return 0; } int main (int argc, char *argv[]) { u8 processAffinityMask, systemAffinityMask; memset (tstr, 0u, usizeof(tstr)); lts[0]->idx = 3; lts[1]->idx = 2; lts[2]->idx = 1; lts[3]->idx = 0; GetProcessAffinityMask (GetCurrentProcess(), &processAffinityMask, &systemAffinityMask); SetPriorityClass (GetCurrentProcess(), HIGH_PRIORITY_CLASS); SetThreadAffinityMask (GetCurrentThread (), 1ull); lts[0]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[0], 0, (void *)&lts[0]->threadId); #ifndef SINGLE_THREAD lts[1]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[1], 0, (void *)&lts[1]->threadId); lts[2]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[2], 0, (void *)&lts[2]->threadId); lts[3]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[3], 0, (void *)&lts[3]->threadId); #endif SetThreadAffinityMask (GetCurrentThread (), processAffinityMask); threadHold = 0; #ifdef SINGLE_THREAD while (lts[0]->idx<4u) {Sleep (1);} #else while (lts[0]->idx+lts[1]->idx+lts[2]->idx+lts[3]->idx<16u) {Sleep (1);} #endif printf ("T0:%1.1f,%1.1f,%1.1f\n", lts[0]->clocksPerLock, lts[0]->failuresPerLock, lts[0]->preTestsPerLock); printf ("T1:%1.1f,%1.1f,%1.1f\n", lts[1]->clocksPerLock, lts[1]->failuresPerLock, lts[1]->preTestsPerLock); printf ("T2:%1.1f,%1.1f,%1.1f\n", lts[2]->clocksPerLock, lts[2]->failuresPerLock, lts[2]->preTestsPerLock); printf ("T3:%1.1f,%1.1f,%1.1f\n", lts[3]->clocksPerLock, lts[3]->failuresPerLock, lts[3]->preTestsPerLock); printf ("T*:%1.1f,%1.1f,%1.1f\n", (lts[0]->clocksPerLock+ lts[1]->clocksPerLock+ lts[2]->clocksPerLock+ lts[3]->clocksPerLock)/ 4., (lts[0]->failuresPerLock+lts[1]->failuresPerLock+lts[2]->failuresPerLock+lts[3]->failuresPerLock)/4., (lts[0]->preTestsPerLock+lts[1]->preTestsPerLock+lts[2]->preTestsPerLock+lts[3]->preTestsPerLock)/4.); printf ("LC:%u\n", (ud)lockCounter); return 0; }

The program was launched on a computer running DELL i5-4310U with DDR3-800, 2 cores / 2 HT with 2.7 GHz and a shared L3 cache.

To begin with, the impact of WOW64 was negligible.

One thread performing insecure locking / unlocking was able to do this every 110 cycles. Setting up an unmanaged lock is useless: any code added to improve one XCHG instruction will make it slower.

With four HTs bombarding the lock variable with lock attempts, the situation is radically changing. The time required to achieve a successful lock goes into 994 cycles, of which a significant part can be attributed to failed lock attempts 2.2. In other words, in a situation with a high level of competition, an average of 3.2 locks should be taken to achieve successful blocking. Obviously, 110 cycles did not become 110 * 3.2, but closer to 110 * 9. Thus, other mechanisms operate here, like tests on an old machine. In addition, an average of 994 cycles span the range between 716 and 1157

Pre-testing blocking options required about 95% of the cycles that were requested using the simplest version (XCHG). On average, they would perform 17 CMPs to find the ability to complete 1.75 locks, of which 1 was successful. I recommend using preliminary testing not only because it is faster: it imposes less load on the bus locking mechanism (3.2-1.75 = 1.45 fewer blocking attempts), although this slightly increases the complexity.

+4

Olof Forshell Oct 19 '12 at 17:57

source share

I'm just asking:

Before you dig this depth into a direct lock and almost locked data structures:

Do you - in your tests and your application - make sure that competing threads are guaranteed to work on different cores?

If this is not the case, you can end the program that works fine on your development machine, but is difficult / unsuccessful in it, because one thread must be both a locker room and an unlocker of your spinlock.

To give you a drawing: On Windows, you have a standard time slice of 10 milliseconds. If you are not convinced that two physical flows are involved in locking / unlocking, you will receive about 500 locks / unlocks per second, and this result will be very meh

-one

Nils Pipenbrinck Aug 14 '12 at 20:19

source share

huseyin tugrul buyukisik · Accepted Answer · 2012-08-14 19:36

Just look here: x86 spinlock using cmpxchg

And thanks to Corey Nelson

 __asm{ spin_lock: xorl %ecx, %ecx incl %ecx spin_lock_retry: xorl %eax, %eax lock; cmpxchgl %ecx, (lock_addr) jnz spin_lock_retry ret spin_unlock: movl $0 (lock_addr) ret }

And another source says: http://www.geoffchappell.com/studies/windows/km/cpu/cx8.htm

  lock cmpxchg8b qword ptr [esi] is replaceable with the following sequence try: lock bts dword ptr [edi],0 jnb acquired wait: test dword ptr [edi],1 je try pause ; if available jmp wait acquired: cmp eax,[esi] jne fail cmp edx,[esi+4] je exchange fail: mov eax,[esi] mov edx,[esi+4] jmp done exchange: mov [esi],ebx mov [esi+4],ecx done: mov byte ptr [edi],0

And here is a discussion of locks and locks: http://newsgroups.derkeiler.com/Archive/Comp/comp.programming.threads/2011-10/msg00009.html

Fastest built-in spin lock

More articles: