Boost :: dimensions of the stream data structure on the ridiculous side?

Compiler: clang ++ x86-64 on linux.

Some time has passed since I wrote any complex low-level system code, and I use the program against system primitives (windows and pthreads / posix). So, from # s and out slipped out of my memory. I am working with boost::asio and boost::thread at the moment.

To emulate a synchronous RPC for an asynchronous executor function ( boost::io_service with multiple threads io::service::run ', where io_serviced::post ' ed requests), I use fast synchronization primitives. For rarity, I decided on sizeof primitives. This is what I see.

 struct notification_object { bool ready; boost::mutex m; boost::condition_variable v; }; ... std::cout << sizeof(bool) << std::endl; std::cout << sizeof(boost::mutex) << std::endl; std::cout << sizeof(boost::condition_variable) << std::endl; std::cout << sizeof(notification_object) << std::endl; ... 

Output:

 1 40 88 136 

Forty bytes for a mutex ????? WTF! 88 for condition_variable !!! Please keep in mind that I am reflected in this bloated size because I am thinking of an application that could create hundreds of notification_object

This level of overhead for portability seems ridiculous, can anyone justify it? As far as I remember, these primitives should be 4 or 8 bytes wide depending on the CPU memory model.

+3
source share
4 answers

If you look at “dimensional overhead” for any type of synchronization primitive, keep in mind that they cannot be packed too tightly. This is because, for example, two mutexes sharing a cache line will end up in caching (false sharing) if they are used at the same time, even if users purchasing these locks never "conflict". That is, imagine two threads that trigger two loops:

 for (;;) { lock(lockA); unlock(lockA); } 

and

 for (;;) { lock(lockB); unlock(lockB); } 

You will see twice as many iterations for two different threads as compared to a single thread executing one cycle, if and only if two locks are not on the same line. If lockA and lockB are on the same line, the number of iterations in the thread will be half - because the kelin with these two locks will constantly bounce between the processor cores that execute these two threads.

Therefore, although the actual data size of the primitive data type underlying the spinlock or mutex can only be a byte or a 32-bit word, the effective data size of such an object is often larger.

Keep this in mind before claiming that "my mutexes are too large." In fact, on x86 / x64, 40 bytes are too small to prevent false sharing, since caches currently have at least 64 bytes.

In addition, if you are very concerned about memory usage, consider that notification objects are not necessarily unique - condition variables can serve to trigger for different events (via predicate , which boost::condition_variable knows about). Thus, one mutex / CV pair could be used for an entire state machine instead of one such pair for each state. The same thing happens, for example. thread pool synchronization - having more locks than threads is not necessarily useful.

Edit:. For a few more references to the “false exchange” (and the negative performance impact caused by placing several atom-updated variables on the same line), see (among other things) the following SO postings:

As said, when using several "synchronization objects" (be it atomically updated variables, locks, semaphores, etc.) in a multi-core configuration with a cache in the kernel, each of them is a separate secret of space. Here you use the commercial use of memory for scalability, but in fact, if you find yourself in a region where your software requires several million locks (creating these GBs mem), you either have funding for several hundred GB of memory (and hundreds of processor cores ), or you are doing something wrong in your software.

In most cases (lock / atomic for a particular instance of class / struct ) you get indentation for free if the object instance containing the atomic variable is large enough.

+23
source

In my 64 bit Ubuntu field:

 #include <pthread.h> #include <stdio.h> int main() { printf("sizeof(pthread_mutex_t)=%ld\n", sizeof(pthread_mutex_t)); printf("sizeof(pthread_cond_t)=%ld\n", sizeof(pthread_cond_t)); return 0; } 

prints

 sizeof(pthread_mutex_t)=40 sizeof(pthread_cond_t)=48 

This means that your statement that

This level of overhead for portability seems ridiculous, can someone justify it to me? as far as I remember, these primitives are 4 or 8 bytes wide depending on the CPU memory model.

quite simply wrong.

If you're wondering where the extra 40 bytes made by boost::condition_variable come from, the Boost class uses an internal mutex.

In a nutshell, on this platform, boost::mutex has exactly zero overhead compared to pthread_mutex_t , and boost::condition_variable has the overhead of additional internal mutexes. Regardless of whether the latter is acceptable for your application, you must decide.

PS I would advise you to stick with the facts and avoid using the inflammatory language in your posts. I am for having almost decided to ignore your post solely by its tone.

+19
source

Looking at the implementation:

 class mutex : private noncopyable { public: friend class detail::thread::lock_ops<mutex>; typedef detail::thread::scoped_lock<mutex> scoped_lock; mutex(); ~mutex(); private: #if defined(BOOST_HAS_WINTHREADS) typedef void* cv_state; #elif defined(BOOST_HAS_PTHREADS) struct cv_state { pthread_mutex_t* pmutex; }; #elif defined(BOOST_HAS_MPTASKS) struct cv_state { }; #endif void do_lock(); void do_unlock(); void do_lock(cv_state& state); void do_unlock(cv_state& state); #if defined(BOOST_HAS_WINTHREADS) void* m_mutex; #elif defined(BOOST_HAS_PTHREADS) pthread_mutex_t m_mutex; #elif defined(BOOST_HAS_MPTASKS) threads::mac::detail::scoped_critical_region m_mutex; threads::mac::detail::scoped_critical_region m_mutex_mutex; #endif }; 

Now let me split the parts without data and reorder:

 class mutex : private noncopyable { private: #if defined(BOOST_HAS_WINTHREADS) void* m_mutex; #elif defined(BOOST_HAS_PTHREADS) pthread_mutex_t m_mutex; #elif defined(BOOST_HAS_MPTASKS) threads::mac::detail::scoped_critical_region m_mutex; threads::mac::detail::scoped_critical_region m_mutex_mutex; #endif }; 

Therefore, apart from noncopyable , I see not too much overhead that does not occur with system mutexes.

+6
source

Sorry, I'm commenting on this here, but I don't have enough reputation to add a comment.

@FrankH, cache handling is not a good excuse to make the data structure larger. There are cache lines that can even have 128 bytes of size, this does not mean that the mutex should be so large.

I think that programmers should be warned about the need to separate synchronization objects in memory so that they do not share the same cache line. What can be achieved by inserting an object into a sufficiently large data structure, without inflating the data structure with unused bytes. On the other hand, inserting unused bytes can degrade program speed, because the CPU must get an extra cache line to access the same structure.

@ Hassan Syed, I do not think that mutexes were programmed in this type of caching optimization. Instead, I think it's the way they are programmed for support, thinking of inheriting priorities, blocking attachments .... As a suggestion, if you need a lot of mutexes in your program, consider something like a pool (array) of mutexes and save only the index in your nodes (of course, taking care of memory sharing). I let you think about the details of this solution.

+3
source

Source: https://habr.com/ru/post/945601/


All Articles