What’s the problem: if people had a similar problem: after some discussions with Mathworks support, this turned out to be a conflict between the system enhancement and Matlab supplied with the library extension: when I compiled with the system acceleration headers and the related (older) Matlab extension libraries, he was interrupted. When I compiled and dynamically linked to a system upgrade, but then it dynamically loaded the Matlab acceleration libraries, it hung constantly.
Static binding to system operation, as well as loading the correct headers for the boost version that Matlab sends and compiles with them. Of course, Matlab Mac builds do not have version numbers in their file names, although Linux and, presumably, Windows builds do. R2011b uses boost 1.44 for reference.
I have multi-threaded code that works fine when it is compiled directly, but segfaults and / or deadlocks when called from the Matlab mex interface. I don’t know if a different environment shows a flaw in my code or what, but I can’t understand it.
I run this on three machine configurations (although there are several CentOS mailboxes):
- OSX 10.7, g ++ 4.2, boost 1.48, Matlab R2011a (clang ++ 2.1 also works autonomously, did not try to use mex to use clang)
- ancient CentOS, g ++ 4.1.2, boost 1.33.1 (debugging and not debugging), Matlab R2010b
- ancient CentOS, g ++ 4.1.2, boost 1.40 (debug versions not installed), Matlab R2010b
The following is the version with this behavior.
#include <queue> #include <vector> #include <boost/thread.hpp> #include <boost/utility.hpp> #ifndef NO_MEX #include "mex.h" #endif class Worker : boost::noncopyable { boost::mutex &jobs_mutex; std::queue<size_t> &jobs; boost::mutex &results_mutex; std::vector<double> &results; public: Worker(boost::mutex &jobs_mutex, std::queue<size_t> &jobs, boost::mutex &results_mutex, std::vector<double> &results) : jobs_mutex(jobs_mutex), jobs(jobs), results_mutex(results_mutex), results(results) {} void operator()() { size_t i; float r; while (true) { // get a job { boost::mutex::scoped_lock lk(jobs_mutex); if (jobs.size() == 0) return; i = jobs.front(); jobs.pop(); } // do some "work" r = rand() / 315.612; // write the results { boost::mutex::scoped_lock lk(results_mutex); results[i] = r; } } } }; std::vector<double> doWork(size_t n) { std::vector<double> results; results.resize(n); boost::mutex jobs_mutex, results_mutex; std::queue<size_t> jobs; for (size_t i = 0; i < n; i++) jobs.push(i); Worker w1(jobs_mutex, jobs, results_mutex, results); boost::thread t1(boost::ref(w1)); Worker w2(jobs_mutex, jobs, results_mutex, results); boost::thread t2(boost::ref(w2)); t1.join(); t2.join(); return results; } #ifdef NO_MEX int main() { #else void mexFunction(int nlhs, mxArray **plhs, int nrhs, const mxArray **prhs) { #endif std::vector<double> results = doWork(10); for (size_t i = 0; i < results.size(); i++) printf("%g ", results[i]); printf("\n"); }
Note that when raising 1.48 I get the same behavior if I change the functor to a standard function and just pass boost::ref to the mutexes / data as additional arguments to boost::thread . Boost 1.33.1 does not support this.
When I compile it directly, it always works fine - I never saw it in any situation:
$ g++ -o testing testing.cpp -lboost_thread-mt -DNO_MEX $ ./testing 53.2521 895008 5.14128e+06 3.12074e+06 3.62505e+06 1.48984e+06 320100 4.61912e+06 4.62206e+06 6.35983e+06
Starting from Matlab, I saw many different types of behavior after making various settings in the code, etc., although there are no changes that actually make sense to me. But here is what I saw with the exact code above:
- On OSX / boost 1.48:
- If it is related to version-version upgrade, I get segfault trying to access an address with an address about 0 in
boost::thread::start_thread , being called from the t1 constructor. - If it is associated with raising the debugging version, it always hangs in the first
boost::thread::join . I'm not quite sure, but I think that the worker threads have actually completed at the moment (they don’t see anything in info threads , which is obvious).
- On CentOS / boost 1.33.1 and 1.40:
- With increasing release, I get segfault in
pthread_mutex_lock , called from boost::thread::join on t1 . - With enhanced debugging, it always hangs in
__lll_lock_wait inside pthread_mutex_lock in the same place. As shown below, workflows are completed at this point.
I don't know how to do anything else with segfaults, since they never occur when I have debugging symbols that can actually tell me what a null pointer is.
In the case of freezing forever, I seem to always get something like this if I switch to GDB:
99 Worker w1(jobs_mutex, jobs, results_mutex, results); (gdb) 100 boost::thread t1(boost::ref(w1)); (gdb) [New Thread 0x47814940 (LWP 19390)] 102 Worker w2(jobs_mutex, jobs, results_mutex, results); (gdb) 103 boost::thread t2(boost::ref(w2)); (gdb) [Thread 0x47814940 (LWP 19390) exited] [New Thread 0x48215940 (LWP 19391)] [Thread 0x48215940 (LWP 19391) exited] 105 t1.join();
It looks like both threads terminated before calling t1.join() . So I tried to add a call to sleep(1) in the "Doing work" section between locks; when I step over, threads exit after calling t1.join() , and it still hangs forever:
106 t1.join(); (gdb) [Thread 0x47814940 (LWP 20255) exited] [Thread 0x48215940 (LWP 20256) exited] # still hanging
If I go up to the doWork function, the results populated with the same results as the standalone version printed on this computer, so that it looks like everything that happens.
I don’t know what causes any of the segfaults or crazy hanging, or why it is that it always works outside of Matlab and is never inside, or why it is different with / without character debugging, and I have no idea how to understand this. Any thoughts?
In the @alanxz sentence, I launched a standalone version of the code under the valgrind memcheck, helgrind and DRD tools:
- On CentOS using valgrind 3.5, none of the tools gives any fatal errors.
- On OSX using valgrind 3.7:
- Memcheck gives no unsupported errors.
- Helgrind crashes for me when running on any binary (including, for example,
valgrind --tool=helgrind ls ) on OSX, complaining about an unsupported command. - DRD gives more than a hundred errors.
DRD errors are quite incomprehensible to me, and although I have read the manual and so on, I cannot understand them. Here's the first one, in the code version, where I commented on the second worker / thread:
Thread 2: Conflicting load by thread 2 at 0x0004b518 size 8 at 0x3B837: void boost::call_once<void (*)()>(boost::once_flag&, void (*)()) (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib) by 0x2BCD4: boost::detail::set_current_thread_data(boost::detail::thread_data_base*) (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib) by 0x2BA62: thread_proxy (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib) by 0x2D88BE: _pthread_start (in /usr/lib/system/libsystem_c.dylib) by 0x2DBB74: thread_start (in /usr/lib/system/libsystem_c.dylib) Allocation context: Data section of r/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib Other segment start (thread 1) at 0x41B4DE: __bsdthread_create (in /usr/lib/system/libsystem_kernel.dylib) by 0x2B959: boost::thread::start_thread() (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib) by 0x100001B54: boost::thread::thread<boost::reference_wrapper<Worker> >(boost::reference_wrapper<Worker>, boost::disable_if<boost::is_convertible<boost::reference_wrapper<Worker>&, boost::detail::thread_move_t<boost::reference_wrapper<Worker> > >, boost::thread::dummy*>::type) (thread.hpp:204) by 0x100001434: boost::thread::thread<boost::reference_wrapper<Worker> >(boost::reference_wrapper<Worker>, boost::disable_if<boost::is_convertible<boost::reference_wrapper<Worker>&, boost::detail::thread_move_t<boost::reference_wrapper<Worker> > >, boost::thread::dummy*>::type) (thread.hpp:201) by 0x100000B50: doWork(unsigned long) (testing.cpp:66) by 0x100000CE1: main (testing.cpp:82) Other segment end (thread 1) at 0x41BBCA: __psynch_cvwait (in /usr/lib/system/libsystem_kernel.dylib) by 0x3C0C3: boost::condition_variable::wait(boost::unique_lock<boost::mutex>&) (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib) by 0x2D28A: boost::thread::join() (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib) by 0x100000B61: doWork(unsigned long) (testing.cpp:72) by 0x100000CE1: main (testing.cpp:82)
Line 66 is the stream construct, and 72 is the join call; there is nothing but comments. As far as I can tell, this suggests that there is a race between this part of the main thread and the initialization of the workflow ... but I really don't understand how this is possible?
The rest of the output from the DRD is here ; I get nothing from this.