How does the cache coherence protocol provide atomicity?

Question

How does the cache coherence protocol provide atomicity?

I understand that atomicity can be guaranteed with operations such as xsub() , without using the LOCK prefix, relying on the cache coherence protocol (MESI / MESIF).

1) How can the cache coherence protocol do?

This makes me wonder if the cache coherence protocol can provide atomicity, why do we need special types of atoms / instructions, etc.?

2) If MOSI implements atomic instructions for multi-core systems, then what is the purpose of LOCK ? Inheritance?

3) If MOSI implements atomic instructions, and MOSI is used for all instructions, then why are atomic instructions so expensive? Of course, the performance should be the same as the usual instructions.

+5

assembly multithreading x86 atomic cpu

user997112 Aug 17 '14 at 1:07

source share

2 answers

Atomicity and the order of memory

For an operation to be atomic, it must seem like one indivisible operation to any observer. This observer may be what can see the effect of the operation, regardless of whether this thread is working, another thread on one processor — a thread on another processor or some component or device in the system. Observers who cannot see the effect of the operation, whether it is the same stream, another stream or device, do not affect whether the operation is atomic or not.

(Note that by processor, I mean that Intel documentation will call the logical processor. A system with two processor sockets, each of which has a quad-core processor with two logical processors per core, will have a total of 16 processors.)

A related but different concept is the ordering of memory. Access to memory is only sequentially coordinated if they appear to the observer, as this happens in the order in which they occur in the program. This guarantee is always applied when the observer is the same stream as the operations being performed. Other more limited guarantees for memory ordering are also possible. Strong, but not consistent order matching can ensure that many types of operations are ordered relative to each other, but not all. Weak memory ordering gives no guarantee of how other threads are accessed.

Compilers and atomicity

When you write a program in C or some other higher-level language, it may seem that some operations are atomic and ordered sequentially, but the compiler usually guarantees this only when viewing from the same thread that performed these operations. However, from the point of view of the compiler, any code that executes when a thread is interrupted asynchronously occurs in different execution threads, even if this code runs on the same OS thread. This means that code running in a signal handler or in a structured exception handler does not guarantee that operations performed outside the handler in the same thread will be atomic or sequential.

Due to the limited general guarantee, the compiler can do things like implement what looks like atomic operations using several assembler instructions, make them non-atomic for other observers. The compiler can also change the memory access order, even completely remove obviously redundant calls. He can do whatever optimizations he wants so much, in the only continuous threading case where the program still behaves as if it performed all these operations in program order.

In a multi-threaded case or where signal or exception handlers are present, special steps must be taken to tell the compiler where you need it to provide broader guarantees of atomicity and ordering of the memory. This is the purpose of special atomic types and functions. Even if the CPU ensures that each instruction is atomic, and each memory access is consistently consistent with all other threads, the compiler does not.

Intel and Atomity Processors

Intel processors make it easy for the compiler to provide these guarantees. Except in some odd cases, the instructions are uninterrupted. Any event that interrupts the execution of an instruction occurs after the completion of a complete instruction or allows the execution of an instruction to resume, as if it had never been executed. Means means that at the level of machine code, each operation is atomic, and each memory operation is consistently consistent, it seems that the code runs on a single processor. In the case of a single processor, nothing needs to be done to provide these guarantees, unless they should be visible to devices other than the processor. In this case, the LOCK prefix combined with unencrypted memory areas should be used to ensure that the read / modify / write instructions are atomic and the memory accesses look consistent with other devices.

In the multiprocessor case, when accessing cached memory, the cache coherence protocol provides atomicity guarantees with most instructions and strong memory ordering, but not sequential coordination. The exact mechanism by which this is done does not matter much, just guarantees are given. Any instruction that refers to only one memory location will be atomic for other processors. Order guarantees are too long to go here, Intel uses 16 tokens to describe them, but they appear to be a superset to ensure that C and C ++ provide a purchase and release order for memory. When this level of memory ordering is specified, C / C ++ atomic operations can use normal, unlocked instructions.

The need for the LOCK prefix and for those instructions where the LOCK prefix is implicit comes when you need more reliable guarantees than the cache negotiation protocol provides. If you need your read / modifiy / write instructions to be atomic, you need to use the LOCK prefix. If you need consistent matching, you need to use the LOCK prefix.

The LOCK prefix is the place where the high cost of atomic operations takes place. This causes the processor to wait for all previous load and storage operations to complete. Despite the fact that when accessing cached memory, the LOCK prefix is completely processed in the cache without LOCK # approval, the processor should still wait to ensure that the operation is consistently consistent with other processors.

Summary

Thus, the answers to your questions:

The cache coherence protocol can only ensure the atomicity of a particular machine code command when viewed from other processors. It cannot guarantee that the compiler generates one command for the operation in which you want to be atomic. It also cannot guarantee that the instruction appears atomic for non-processor devices in the system.
The LOCK prefix is used for machine code instructions that
- perform multiple memory accesses and appear to be atomic to other processors.
- must be consistent with other processors.
- must be atomic and / or sequentially compatible with other non-processor devices.
When it is possible to get the necessary atomicities and guarantees of the memory order without using the LOCK prefix, the instructions used will be the same as regular instructions, and therefore cost the same. Where the LOCK prefix is necessary to provide the necessary guarantees, the cost of the instruction becomes much higher than the usual instruction.

+4

Ross ridge Aug 18 '14 at 0:08

source share

Jester · Accepted Answer · 2014-08-17T02:47:02+0000

In x86 there is no xsub instruction, but there is xadd ;)

You should read the section on the LOCK prefix in the Instruction Manual and section 8.1 LOCKED ATOMIC OPERATIONS in the Software Engineering Guide. Volume 3A: System Programming Guide, Part 1.

A single processor belongs to one core at present with its own cache. When you have multiple caches for multiple cores (physically in the same or separate processor chip), they use some cache matching protocol. In the case of MESI , the kernel executing the atomic instruction first guarantees that it has the right to own the cache line containing the operand and marks it modified , further blocking it. If the other kernel requires a cache line, it will perform a read operation, which the owner’s kernel will track and delay the response until the atomic operation completes.

In single-core systems with a single processor, most instructions are atomic with respect to streaming, with the exception of string instructions using the REP prefix, because scheduling interruptions and, therefore, context switches occur only at the command boundaries. However, a hardware device can observe non-atomic behavior.

How does the cache coherence protocol provide atomicity?

Atomicity and the order of memory

Compilers and atomicity

Intel and Atomity Processors

Summary

More articles: