Discovery and recovery from Windows TDR?

I ran into an odd problem with some OpenCL code that I work on, where every time in the blue moon, Windows TDR will start and reset the GPU. The offensive kernel only works for 150 ms and will work thousands of times (for many hours) before TDR kills it, so Iโ€™m sure that the kernel itself is not to blame.

My concern is that as soon as the TDR starts, the kernel dies and the program is stuck in an eternal state of uncertainty. From what I can tell, the clFinish call never returns.

Is there a way to determine if the core has died so that it can be gracefully processed?

+1
source share
2 answers

I managed to find a solution, although it is far from optimal.

I changed the program so that OpenCL processing is done in a separate thread. I created a global watchdog shared variable between the parent and child process. When a parent spawns a processing function as a stream, it sets the variable at the current time in milliseconds. When the processing thread ends, the reset watchdog variable will be zero.

While the parent thread is waiting for the processing thread to complete, it watches the watchdog timer. If the timer exceeds a certain threshold, the program forcibly exits without waiting for the child process to return.

This solution works with or without TDR for Windows. If TDR is installed and the driver is reset, the clFinish () call will never return and the parent will stop working as soon as the watchdog timer goes off. If TDR is not installed, the escape process will freeze the display, but as soon as the watchdog timer is turned off, the parent will finish processing, stopping freezing.

Now that I have a watchdog timer set, I just wrapped my program in a script: if it failed (positive return code), the program will restart again.

0
source

Ideally, you should get the error code from clFinish or clWaitForEvents with the OpenCL event object generated when the kernel is pasted. Since TDR resets the graphics driver, I do not think that the OpenCL implementation will work reliably, that is, there is no recovery path.

Rather, turn off the TDR completely. This is useful when you are debugging code that gets stuck in an infinite loop that constantly holds the GPU.

If you want to save the TDR, but you can change the code, then using some function of the sleep thread to delay your code for a few milliseconds, you can also eliminate this problem by losing processing speed. This gives the graphics card the ability to respond to image display commands so that TDR does not start.

0
source

Source: https://habr.com/ru/post/1012170/


All Articles