Tips for debugging hard-to-reproduce concurrency errors?

What debugging tips is it difficult to reproduce the concurrency errors that occur, say, once in thousands of tests? I have one of them, and I have no idea how to debug it. I can’t put print messages or debugging clocks everywhere to watch the internal state, because it will change the timings and create a huge amount of information when the error is not successfully reproduced.

+6
source share
8 answers

Here is my method: I usually use a lot of assert () to check data consistency / validity as often as possible. When one statement fails, the program crashes by generating a kernel file. Then I use the debugger with the main file to figure out which stream configuration has corrupted the data.

+5
source

This may not help you, but it will probably help someone see this issue in the future.

If you use the .Net language, you can use the CHESS project from a Microsoft study. It runs unit tests with each type of thread rotation and shows which ones cause the error.

There may be a similar tool for the language you are using.

+2
source

It greatly depends on the nature of the problem. Usually useful are halving (narrowing the search space) + code "instrumentation" with statements about access to flow identifiers, locking / unlocking, locking order, etc. In the hope that the problem will be reproduced the next time the application either logs a detailed message or a core dump that gives you a solution.

+1
source

One way to detect data corruption caused by concurrency error:

  • Add an atomic counter for this data or buffer.
    • Leave all existing synchronization code as is - do not modify them, assuming that you will correct the error in the existing code, while the new atomic counter will be deleted after the error is fixed.
  • When starting a data change, increase the atom counter. When done, reduce.
  • Reset the kernel as soon as you find that the counter is more than one (using something similar to InterlockedIncrement)
+1
source

The target unit test code is time consuming, but effective in my experience.

Reduce the error code as much as you can. Write a test code specific to the apparent culprit code, and run it in the debugger as long as it takes to reproduce the problem.

0
source

One of the strategies I use is to simulate the rotation of threads - this is introducing spin expectations. The caveat is that you should not use standard wait mechanisms for your platform, because they are likely to create memory barriers. If the problem you are trying to fix is ​​caused by the lack of a memory barrier (because it is difficult to remove the barriers when using blocking strategies), then the standard wait mechanisms simply mask the problem. Instead, put an empty loop at the points where you want your code to stop for a moment. This may increase the likelihood of reproducing a concurrency error, but it is not a magic bullet.

0
source

If the error is a deadlock, just attach the debugging tool (e.g. gdb or strace ) to the program after the deadlock occurs, and observing where each thread is stuck, you can often get enough information to track the source of the error quickly.

0
source

A small diagram I made with some debugging methods that you need to consider when debugging multi-threaded code. The graph is growing, please leave comments and tips for adding. http://adec.altervista.org/blog/multithreading-debugging-chart/

0
source

Source: https://habr.com/ru/post/886881/


All Articles