Atomic membrane proposal


When testing a program for scalability, I came across a situation where I need to perform memcpy operation as an atomic operation. I need to copy 64 bytes of data from one place to another.
I came across one solution using swirl over a variable:

struct record{ volatile int startFlag; char data[64]; volatile int doneFlag; }; 

and pseudo code follows

 struct record *node; if ( node->startFlag ==0 ) { // testing the flag if( CompareAndSwap(node->startFlag , 0 ,1 ) ) { // all thread tries to set, only one will get success and perform memcpy operation memcpy(destination,source,NoOfBytes); node->doneFlag = 1; // spinning variable for other thread, those failed in CompAndSwap } else { while ( node->doneFlag==0 ) { // other thread spinning ; // spin around and/or use back-off policy } }} 

Could this be run as atomic memcpy? Although if the thread executing memcpy is unloaded (before or after memcpy, but before setting doneFlag), then others will continue to spin. Or what can be done to make this atom.
The situation is similar to another stream, you have to wait until the data is copied, because they must compare with the inserted data with their own data.
I use the test-and-test-and-set method in the case of startFlag in order to reduce some costly atom work. Spin-locks are also scalable, but I measured that atomic calls give better performance than spin-locks, moreover, I am looking for problems that may arise in this fragment. And since I use my own memory manager, so allocating memory and free calls are expensive for me, so using a different buffer and copying the contents in it, then setting the pointer (since the size of the pointer is under the atomic operation) is expensive, since it will require a lot calls mem-alloc and mem-free.

EDIT I do not use mutex because they do not seem to be scalable , and this is just part of the program, so the critical section is not so small (I understand that atomic operations are difficult to use for a larger critical section).

+6
source share
4 answers

Your piece of code is definitely broken. There is a race on node → startFlag

Unfortunately, there is no atomic way to copy 64 bytes. I think you have several options.

  • Access node -> startFlag in atomic mode. I wrote several posts on this subject: here and here .
  • Protect the whole thing with user mode spinlock. Here is a related message
  • Use an approach similar to RCU. Here you can read about RCU here . In a nutshell, the idea is to reference the buffer you want to copy using a pointer. Then you do:
    • Select a new buffer.
    • Create its contents (memcpy from your source).
    • Atomically replace the buffer with a new one.
    • Wait for all threads accessing the old buffers to expire and free it.

Hope this helps. Alex

+5
source

Use a synchronization mechanism. Mutex seems reasonable.

If you are concerned about scalability, try using a monitor.

+2
source

Late, but only for others coming to this question, the following is simpler and faster has less impact on the cache.

Note. I changed CAS to the corresponding atom built into GCC. No need for volatile, CAS introduces a memory barrier.

 // Simpler structure struct record { int spin = 0; char data[64]; }; struct record *node; while (node->spin || ! __sync_bool_compare_and_swap(&node->spin , 0 , 1)); // spin memcpy(destination,source,NoOfBytes); node->spin = 0; 

PS: I'm not sure that CAS, instead of node → spin = 0, can improve performance a bit more.

+2
source

Do not use lock, use CriticalSection . The locks are heavy, CriticalSections are extremely, extremely fast (just a few instructions depending on the platform). You did not specify the operating system, and the information I read here has experience in Windows, although the other OS should be similar.

Were you afraid that CriticalSections might not be scalable enough for your purpose if they contain a lot of code? The main reason (and probably the argument where you are reading this) is that the CriticalSection cannot alternate in multiple streams quite fine-grained if the streams are stored in CS for a long time. You can avoid this by simply wrapping the CS with only that part of your code that really needs to be atomic. On the other hand: if you use CS too fine-grained , then, of course, the percentage load will increase. This is a compromise that you cannot avoid with any synchronization.

You say you need an atomic operation, this is a copy of 64 bytes: in this case, your overhead synchronization with CS will be negligible . Just give it a try. With the granularity with which you synchronize (about one copy of 64 bytes or about 4 of these copies), you can balance the detail of the flows with the percentage overhead by doing some experiments. But in general: CS is fast enough and quite scalable.

+1
source

Source: https://habr.com/ru/post/892814/


All Articles