Java 8 Unsafe: xxxFence () Instructions

In Java 8, three instructions for protecting memory ( source ) were added to the Unsafe class:

 /** * Ensures lack of reordering of loads before the fence * with loads or stores after the fence. */ void loadFence(); /** * Ensures lack of reordering of stores before the fence * with loads or stores after the fence. */ void storeFence(); /** * Ensures lack of reordering of loads or stores before the fence * with loads or stores after the fence. */ void fullFence(); 

If we define a memory barrier as follows (which I consider more or less clear):

Consider X and Y as the types / classes of operations to be reordered,

X_YFence() is a memory barrier instruction that ensures that all type X operations before the barrier are completed before any type Y operation after starting the barrier.

Now we can โ€œmapโ€ the barrier names from Unsafe to this terminology:

  • loadFence() becomes load_loadstoreFence() ;
  • storeFence() becomes store_loadStoreFence() ;
  • fullFence() becomes loadstore_loadstoreFence() ;

Finally, my question is - why don't we have load_storeFence() , store_loadFence() , store_storeFence() and load_loadFence() ?

My guess would be - they really are not needed, but I donโ€™t understand why at the moment. Therefore, I would like to know the reasons why they are not added. Guess about this, too, welcome (I hope that this will not lead to the fact that this issue will be offtopic, as based on opinions).

Thanks in advance.

+44
java concurrency java-8 unsafe memory-fences
May 12 '14 at 7:28
source share
3 answers

Summary

The processor cores have special buffers to arrange the memory, which help them perform extraordinary execution. They can be (and usually) shared for loading and storage: LOB for load order buffers and SOB for storage order buffers.

The fencing operations selected for the Unsafe API were selected based on the following assumption: the core processors will have separate load order buffers (for reordering loads), order storage buffers (for reordering stores).

Therefore, based on this assumption, from a software point of view, you can request one of three things from the CPU:

  • Empty LOB (loadFence): means that no other instructions will start executing on this kernel until ALL LOB records are processed. In x86, this is LFENCE.
  • Empty SOB (storeFence): means that no other commands will start running on this kernel until ALL entries in the SOB are processed. In x86, this is FATE.
  • Clear both LOB and SOB (fullFence): means both of the above. In x86, this is MFENCE.

In fact, each particular processor architecture provides various guarantees for memory ordering, which may be more stringent or more flexible than higher. For example, the SPARC architecture may change the order of loading and storage, while x86 will not. In addition, there are architectures where LOB and SOB cannot be controlled individually (i.e. only a complete fence is possible). In both cases:

  • when the architecture is more flexible, the API simply does not provide access to the laxer sequencer combinations, depending on the choice.

  • when the architecture is more strict, the API simply implements a more stringent guarantee of consistency in all cases (for example, all 3 calls will actually be implemented as a complete fence)

The reason for choosing a particular API is explained in the JEP according to the assylias answers provided, which are 100% in place. If you know about memory ordering and cache coherency, just answer the assylias request. I believe that the fact that they comply with the standardized instruction in the C ++ API is a major factor (greatly simplifies the implementation of the JVM): http://en.cppreference.com/w/cpp/atomic/memory_order In all likelihood, the actual the implementation will be called into the corresponding C ++ API instead of using a special instruction.

Below I have a detailed explanation with x86-based examples that will provide all the context needed to understand these things. In fact, demarcation (the section below answers another question: "Can you provide some basic examples of how memory fencing works to control cache coherency in x86 architecture?"

The reason for this is that I myself (the software developer, not the hardware developer) did not understand what memory reordering was until I found out specific examples of how cache coherence actually works in x86. This provides an invaluable context for discussing memory barriers in general (for other architectures). In the end, I discuss SPARC a bit, using the knowledge gained from x86 examples.

Link [1] is an even more detailed explanation and has a separate section for discussing each of: x86, SPARC, ARM, and PowerPC, so it is well read if you are interested in more detailed information.




Architecture example

x86

x86 provides 3 types of fencing instructions: LFENCE (loading fence), SFENCE (loading fence) and MFENCE (loading fence), so it compares 100% with the Java API.

This is because x86 has separate load order buffers (LOBs) and storage order buffers (SOBs), so LFENCE / SFENCE statements apply to the corresponding buffer, while MFENCE applies to both.

SOBs are used to store the outgoing value (from the processor to the caching system), while the cache coherence protocol works to get permission to write to the cache line. LOBs are used to store invalid requests, so invalidation can be performed asynchronously (reduces the stop on the receiving side in the hope that the executable code will not really be needed there).

Out of order stores and SFENCE

Suppose you have a dual-processor system with two CPUs, 0 and 1, executing the routines below. Consider the case where the cache line containing failure originally belongs to CPU 1, while the cache line containing shutdown originally belongs to CPU 0.

 // CPU 0: void shutDownWithFailure(void) { failure = 1; // must use SOB as this is owned by CPU 1 shutdown = 1; // can execute immediately as it is owned be CPU 0 } // CPU1: void workLoop(void) { while (shutdown == 0) { ... } if (failure) { ...} } 

In the absence of a store railing, CPU 0 may signal a shutdown due to a malfunction, but CPU 1 will exit the loop and will NOT fall into the error handling block if the block.

This is due to the fact that CPU0 will write the value 1 for failure to the storage order buffer, also sending a message about cache connectivity in order to get exclusive access to the cache line. Then it proceeds to the next instruction (while waiting for exclusive access) and updates the shutdown flag immediately (this cache line belongs exclusively to CPU0 already, so there is no need to negotiate with other kernels). Finally, when he later receives an invalidation message from CPU1 (regarding failure ), he proceeds to SOB processing for failure and writes the value to the cache (but the order is now canceled).

Insert storeFence () will fix:

 // CPU 0: void shutDownWithFailure(void) { failure = 1; // must use SOB as this is owned by CPU 1 SFENCE // next instruction will execute after all SOBs are processed shutdown = 1; // can execute immediately as it is owned be CPU 0 } // CPU1: void workLoop(void) { while (shutdown == 0) { ... } if (failure) { ...} } 

The last aspect worth mentioning is that x86 has a storage transfer function: when the CPU writes a value that gets stuck in the SOB (due to cache coherency), it can subsequently try to execute a load command for the same address BEFORE the SOB is processed and delivered to cache. Therefore, processors will access SOA PRIOR to access the cache, so the value obtained in this case is the last recorded value from the SOB. this means that storages from this kernel can never be reordered with subsequent loads from this kernel no matter what.

Custom loads and LFENCE

Now suppose you have a store fence in place and are happy that shutdown cannot overtake failure on its way to CPU 1 and focus on the other side. Even with a store fence, there are scenarios in which the wrong occurs. Consider the case where failure is in both shared caches, while shutdown present only in CPU0 cache and belongs to it. Bad things can happen as follows:

  • CPU0 writes 1 to failure ; It also sends a CPU1 message to invalidate its copy of the shared cache line as part of the cache coherence protocol.
  • CPU0 executes SFENCE and kiosks, waiting for the SOB used for failure to be committed.
  • CPU1 checks the shutdown due to the while loop and (realizing that there is no value) sends a cache coherency message to read the value.
  • CPU1 receives a message from CPU0 in step 1 to cancel the failure by sending it an immediate confirmation. NOTE: this is done using an invalidation queue, so in fact it simply enters a note (highlights the entry in its LOB) in order to later execute the invalidation, but does not actually execute it before sending the confirmation.
  • CPU0 receives confirmation for failure and proceeds to SFENCE with the next command
  • CPU0 writes 1 to shutdown without using SOB, because it already owns the cache line completely. no additional invalidation message is sent because the cache line is exclusive to CPU0
  • CPU1 gets the value shutdown and passes it to its local cache, moving on to the next line.
  • CPU1 checks the failure value for the if statement, but since the invalidate queue (LOB note) has not yet been processed, it uses the value 0 from its local cache (not part of the block).
  • CPU1 processes the invalidate queue and updates failure to 1, but it's too late ...

What we call boot order buffers is actaully the order of invalidation requests, and the above can be fixed with:

 // CPU 0: void shutDownWithFailure(void) { failure = 1; // must use SOB as this is owned by CPU 1 SFENCE // next instruction will execute after all SOBs are processed shutdown = 1; // can execute immediately as it is owned be CPU 0 } // CPU1: void workLoop(void) { while (shutdown == 0) { ... } LFENCE // next instruction will execute after all LOBs are processed if (failure) { ...} } 

Your question on x86

Now that you know what SOB / LOB is doing, think about the combinations you mentioned:

 loadFence() becomes load_loadstoreFence(); 

No, the boot fence is awaiting LOB processing, essentially omitting the invalidation queue. This means that all subsequent downloads will have updated data (without re-ordering), since they will be extracted from the cache subsystem (which is coherent). CANNNOT stores must be reordered with subsequent loads because they do not pass through the LOB. (and, moreover, forwarding to the store takes care of locally modified cachce lines). From the point of view of THIS particular core (the one that runs on the cargo fence), the magazine that follows the load fence will be executed AFTER all registers have loaded data. There is no such thing.

 load_storeFence() becomes ??? 

There is no need for load_storeFence, as this does not make sense. To save something, you must calculate it using input. To get input, you must complete the load. Stores will occur using data derived from loads. If you want to make sure that you see the updated values โ€‹โ€‹of all other processors at boot, use loadFence. For loads after the fence storage system takes care of the sequential order.

All other cases are similar.




SPARC

SPARC is even more flexible and can reorder stores with subsequent loads (and loads with subsequent stores). I was not so familiar with SPARC, so my GUESS was that there is no storage redirection (SOBs are not processed when the address is reloaded), so dirty reads are possible. Actually, I was wrong: I found the SPARC architecture in [3], and the reality is that the repository forwarding is threaded. From section 5.3.4:

All loads check the storage buffer (only the same stream) for read after write (RAW). Full RAW occurs when the download dword address matches the storage address in the STB and all download bytes are valid in the storage buffer. Partial RAW occurs when the dword addresses match, but all bytes are invalid in the storage buffer. (For example, ST (word storage), followed by LDX (loading dword) to the same address, results in partial RAW, since the full dword is not in the storage buffer entry.)

So, different threads access different storage buffers, therefore, the possibility of dirty reading after stores.




References

[1] Memory gateways: hardware for software hackers, Linux technology center, IBM Beaverton http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf

[2] Intelยฎ 64 and IA-32 Software Development Guide, Volume 3A http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32- architectures-software-developer-vol-3a-part-1-manual.pdf

[3] OpenSPARC T2 Core Microarchitecture Specification http://www.oracle.com/technetwork/systems/opensparc/t2-06-opensparct2-core-microarch-1537749.html

+50
May 20 '14 at 22:11
source share

A good source of information is JEP 171 itself .

Justification:

Three methods provide three different types of memory fencing, which some compilers and processors must ensure that certain calls (load and store) are not reordered.

Implementation (extraction):

for versions of the C ++ environment (in prims / unsafe.cpp) that implement using the existing OrderAccess methods:

  loadFence: { OrderAccess::acquire(); } storeFence: { OrderAccess::release(); } fullFence: { OrderAccess::fence(); } 

In other words, the new methods are closely related to how memory fencing is implemented at the JVM and CPU levels. They also correspond to memory instructions available in C ++ , the language in which the access point is implemented.

A finer approach would probably be feasible, but the benefits are not obvious.

For example, if you look at the processor instruction table in the JSR 133 Cookbook , you will see that LoadStore and LoadLoad map to the same instructions on most architectures, i.e. both are actually Load_LoadStore instructions. Thus, having one Load_LoadStore ( loadFence ) loadFence at the JVM level seems like a reasonable design solution.

+7
May 21 '14 at 11:48
source share

The document for storeFence () is invalid. See https://bugs.openjdk.java.net/browse/JDK-8038978

loadFence () is a LoadLoad plus a LoadStore, which is why a useful often-called fence capture.

storeFence () is the StoreStore plus the LoadStore, which is why it is often called the release of the fence.

LoadLoad LoadStore StoreStore - cheap fences (nop on x86 or Sparc, cheap on Power, maybe more expensive on ARM).

IA64 has different instructions for the semantics of receipt and release.

fullFence () - LoadLoad LoadStore StoreStore plus StoreLoad.

The StordLoad fence is expensive (on almost the entire processor), almost as expensive as a full fence.

This justifies the design of the API.

+4
Jun 02 '15 at 18:30
source share



All Articles