Thousands of readers / writers in one process

I am currently developing a cross-platform (Linux / Windows) C ++ server application with a large-scale synchronization scheme. I internally use boost :: thread as an abstraction of OS specific threads. My problem is to protect the data array, each element of the array, protected by an independent read / write lock .

My array contains 4096 elements . Considering the solution to the problem “writer-priority of readers-writers”, which is presented in “ The Little Book of Semaphores” (p. 85), my application will need 5 semaphores per array element. This gives a total of about 20,000 semaphores (or, which is equivalent to 20,000 mutexes + 20,000 variable conditions).

An additional specificity of my application is that at the moment most semaphores are inactive (usually about 32 "client" threads wait / signal on thousands of semaphores). Note that since the entire server is running in one process, I use lightweight thread-based semaphores (rather than intermassive semaphores).

My question is twofold:

  • Is it recommended to create a total of 20,000 semaphores on Linux and Windows for a single process? Well, of course, I think this is not so ...

  • If this practice is not recommended, which method can I use to reduce the number of actual semaphores, for example. to create a set of N "emulated semaphores" at the top of the 1 actual semaphore ? I believe this would be an interesting solution, because most of my semaphores are inactive at the moment.

Thanks in advance!

Summary of answers so far

  • Using thousands of semaphores is not recommended , especially from a cross-platform perspective. And so, even if they are not interprocess semaphores (they still consume handles under Windows).
  • A direct way to solve my problem is to split my array into, for example, 64 subarrays of 16 elements and bind each of these subarrays with a single read / write lock . Unfortunately, this leads to big disagreements (1 author blocks reading up to 15 elements).
  • Digging into the Boost source code, I found that:

    • The "boost :: mutex" implementation does not include CRITICAL_SECTION objects on Windows (but CreateEvent and ReadWriteBarrier)
    • "boost :: shared_mutex" uses CreateSemaphore on Windows (which are heavy, interprocess objects) and
    • "boost :: shared_mutex" does not migrate "pthread_rwlock_t" to Linux.

    The reasons for this seem unclear to me. In particular, using interprocess objects for "boost :: shared_mutex" under Windows seems to me to be suboptimal.

Summary of open questions so far

  • How to create a set of N-emulated semaphores in the upper part 1 of the actual semaphore, keeping the difference between emulated semaphores as small as possible?
  • How do boost :: mutex and boost :: shared_mutex compare with their native counterparts (CRITICAL_SECTION and pthread_rwlock_t)?
+6
source share
3 answers
  • This is not recommended. You should not do this because on Windows it will consume 1 object descriptor per semaphore. A process can only control a certain number of Handles. Theme / Process and other Windows objects may need to use Handle objects and fail if they cannot. On Linux, this is similar to the concept of a file descriptor.

  • Divide 4096 elements into 30 (for example) sets of 140 elements and assign each 140-group a single Semaphore. Then 30 (in this example) threads will try to access these 30 sets and they will be synchronized based on each 140-Semaphore group.

+1
source

I will tell you what I think of it from a Windows perspective. I am very experienced in writing server applications for Windows.

Firstly, there is absolutely no problem for creating 20k semaphores for a single process. This is a fairly lightweight kernel object. Even the "interprocess" semaphores.

I see, however, another problem with your design. You should be aware that every operation you perform on a kernel object (for example, a semaphore / mutec) involves a heavy transaction in kernel mode (aka system call). Each such call can cost you about two thousand processor cycles, even if there are no conflicts at all.

So, you can find yourself a situation where most of the processor time is spent only on calling synchronization methods.

Otherwise, you can use blocked operations to synchronize threads. They cost a lot less (usually dozens of processor cycles).

There is also an object called a critical section. This is a kind of hybrid of a locked operand and a kernel object (which is used if there is an actual collision). You should check how long you usually block your items. If these are usually short locks - just use critical sections, forget about complex read and write locks.

If you are still dealing with long-term locks, and you need to read and write locks, and you see that you spend a lot of CPU in kernel-mode transactions, consider creating your own (or try to find an existing) hybrid implementation of such a lock.

+1
source

On Linux, you should finally not implement locks yourself, but use posix_rwlock_t .

The presence of an array of 4096 such elements should not present any particular problem. POSIX locking structures are pretty efficiently implemented on Linux. In particular, they use atomic operations, when possible, on the “fast track” and only enter system calls (especially for FUTEX) when there is an overload on this particular lock. Therefore, if you perform relatively carefully, so that in any thread there is only one or two locks at a time, the Linux limit will be provided only by your total number of worker threads, and not by the number of objects themselves.

+1
source

Source: https://habr.com/ru/post/894432/


All Articles