If you look at “dimensional overhead” for any type of synchronization primitive, keep in mind that they cannot be packed too tightly. This is because, for example, two mutexes sharing a cache line will end up in caching (false sharing) if they are used at the same time, even if users purchasing these locks never "conflict". That is, imagine two threads that trigger two loops:
for (;;) { lock(lockA); unlock(lockA); }
and
for (;;) { lock(lockB); unlock(lockB); }
You will see twice as many iterations for two different threads as compared to a single thread executing one cycle, if and only if two locks are not on the same line. If lockA and lockB are on the same line, the number of iterations in the thread will be half - because the kelin with these two locks will constantly bounce between the processor cores that execute these two threads.
Therefore, although the actual data size of the primitive data type underlying the spinlock or mutex can only be a byte or a 32-bit word, the effective data size of such an object is often larger.
Keep this in mind before claiming that "my mutexes are too large." In fact, on x86 / x64, 40 bytes are too small to prevent false sharing, since caches currently have at least 64 bytes.
In addition, if you are very concerned about memory usage, consider that notification objects are not necessarily unique - condition variables can serve to trigger for different events (via predicate , which boost::condition_variable knows about). Thus, one mutex / CV pair could be used for an entire state machine instead of one such pair for each state. The same thing happens, for example. thread pool synchronization - having more locks than threads is not necessarily useful.
Edit:. For a few more references to the “false exchange” (and the negative performance impact caused by placing several atom-updated variables on the same line), see (among other things) the following SO postings:
As said, when using several "synchronization objects" (be it atomically updated variables, locks, semaphores, etc.) in a multi-core configuration with a cache in the kernel, each of them is a separate secret of space. Here you use the commercial use of memory for scalability, but in fact, if you find yourself in a region where your software requires several million locks (creating these GBs mem), you either have funding for several hundred GB of memory (and hundreds of processor cores ), or you are doing something wrong in your software.
In most cases (lock / atomic for a particular instance of class / struct ) you get indentation for free if the object instance containing the atomic variable is large enough.