How to debug a traceless disaster

During development outside the application, we have a really unpleasant error in a special situation. The symptom is simple enough that the process disappears. Logs simply end abruptly, no crash dumps or anything can be found, no zombie processes exist. Dr. Watson did not notice anything, leaving us without a trace.

The error is not easy to reproduce, to reproduce this error takes an average of 3-4 hours, repeating the same steps. So somewhere there is some kind of race condition. We have special functions that handle both SEH and regular exceptions, so none of them should go unnoticed.

Debugging should be performed on a special computer, since it works on very specialized equipment. Thus, only remote debugging is available. And when remote debugging is connected, the C ++ builder did not notice that the application is missing, as well as crash and burns, when we try to perform any debugging in a nonexistent process.

We use a wide variety of technologies with this software:

  • Opengl
  • Directshow + some COTS filters
  • COM / DCOM
  • Serial COM ports talking to specialized equipment
  • C ++ Builder (which uses different stack frames than VC ++)

So, as you know, I have nothing to work with. What I am doing now is that I am trying to narrow it down by going to different places in the code to find if there is any particular point in the code where the error occurs. I am also trying to remove as many aspects of the action that I perform as possible in order to make the case as simple as possible. But this is a very complicated operation, and this process takes a lot of time, and time (as usual) is a scarce resource.

I am wondering if anyone has any good advice for me, either the reason (in general, what makes the process just stop without any notice), or the methods of debugging such an elusive crash?

+4
source share
4 answers

When native Windows code experiences stack overflows (usually due to infinite recursion), the process sometimes disappears exactly as you describe. Standard error dialogs and exception handling require some stack space, and where they do not exist, they cannot work. (Later versions of Windows handle this better and should always throw an exception - Windows XP is not "later" under this definition.)

The easiest way to debug is to write log entries to write (and possibly exit) for each function. These messages should go directly to the file, and if you have buffered output (like cout or similar), you should immediately flush it every time. When you succeed in causing a crash, you will encounter a stack trace that can at least localize the problem.


Infinite recursion is not the only reason (although it is more common). If very large variables are allocated on the stack (usually arrays with thousands / millions of elements), then the same problem may arise. In particular, the alloca() function may obscure the cause of this type.

If you run under the debugger and interrupt / log on the exceptions of the protective page, you will be notified when the stack expands - let the exception be handled, since it is used to get more memory and in fact it cannot be connected with the problem.


The last reason for a non-transient process to overflow a vanishing process is the wandering call to exit() or ExitProcess() . A full-text search should be able to eliminate this in most cases - a breakpoint in the ExitProcess function in the debugger will do this completely.

+7
source

Why are you not trying to use windbg, it can also remotely connect through a named pipe or serial port.


NO BSOD, no Rootkit, no Fun ~~ Biswanth Chowdhury - Win32 Kernel *

+2
source

Try running it with less heap. If the problem is that you have run out of memory, this will lead to the fact that the accident will happen earlier.

+2
source

If you want to debug your script most often, try running it on a virtual machine and doing so “snapshots” before this happens.

The problem here may be inconsistency with the states of the specialized equipment that you mention when connecting through the serial port.

+1
source

Source: https://habr.com/ru/post/1341775/


All Articles