Debugging a strange error, which depends on the selected scheduler

I experience strange behavior in the software I'm working on. This is a real-time controller written in C ++, running on Linux, and it makes extensive use of multithreading.

When I run the program without asking it in real time, everything works as I expect. But when I ask you to switch to real time, there is a distinctly reproducible error that allows the application to crash. I think it must be some kind of dead end thing, because it is a mutex that picks up a timeout and ultimately raises a statement.

My question is how to find him. Looking at the return line from the produced kernel is not very useful, since the cause of the problem lies somewhere in the past.

The following code switches between "normal" and "realtime" modes:

In main.cpp (simplified return code is checked through statements):

if(startAsRealtime){ struct sched_param sp; memset(&sp, 0, sizeof(sched_param)); sp.sched_priority = 99; sched_setscheduler(getpid(), SCHED_RR, &sp);} 

In each thread (a simplified return code is checked through statements):

 if(startAsRealtime){ sched_param param; pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED); pthread_attr_getschedparam(&attr, &param); param.sched_priority = priority; pthread_attr_setschedpolicy(&attr, SCHED_RR); pthread_attr_setschedparam(&attr, &param);} 

Thanks in advance

0
source share
2 answers

If you use glibc as your C library, you can use the answer to the question Is it possible to list the mutexes that contain the thread to find from the thread that holds the mutex that is disconnected. This should start to narrow things down - you can check this thread and find out why it is not abandoning the mutex.

+1
source

One of your threads in real time can rotate in a loop (not yielding), thus starving other threads and leading to a mutex timeout.

There may also be a race condition, which appears only when switching to real time. Real-time event time causes a dead end.

If you have places where you acquire several levels of locks or block recursively, these should be the first places you suspect.

If you really don't know where the problem is, try using the binary search method to bracket the problem. Recursively cut out half of the functionality until you reduce it to the actual problem. You may need to mock some subsystems that are temporarily cut.

You can apply this binary search technique to your mutex receiving timeouts to find which one is to blame.

0
source

Source: https://habr.com/ru/post/888893/


All Articles