OpenCL: Distinctive calculation error when interrupting TDR

When performing long OpenCL calculations on Windows using a graphics processor that also launches the main screen, the OS can interrupt the calculation using Timeout Detection and Recovery .

In my experience (Java, using JavaCL from NativeLibs4Java, with the NVidia GPU), this manifests itself as a "Out of Resources" (cl_out_of_resources) error when calling clEnqueueReadBuffer.

The problem is that I get the same message in the OpenCL program for other reasons (for example, due to access to invalid memory).

Is there a (semi) reliable way to distinguish "From resources" caused by TDR and "From resources" caused by other problems?

Alternatively, is it possible, at least reliably (in Java / via the OpenCL API) to determine that the GPU used for the calculation also triggers the display?

I know this question , however, the answer to this question is related to scenarios when clFinish does not return, which is not a problem for me (my code has still never been frozen in the OpenCL API).

+6
source share
1 answer

Is there a (semi) reliable way to distinguish "From Resources" caused by DTR and "From Resources" caused by other problems?

one)

If you can access

KeyPath : HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers KeyValue : TdrDelay ValueType : REG_DWORD ValueData : Number of seconds to delay. 2 seconds is the default value. 

from wmi to multiply it by

 KeyPath : HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers KeyValue : TdrLimitCount ValueType : REG_DWORD ValueData : Number of TDRs before crashing. The default value is 5. 

again with wmi. You get 10 seconds when you multiply them. And you should get

 KeyPath : HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers KeyValue : TdrLimitTime ValueType : REG_DWORD ValueData : Number of seconds before crashing. 60 seconds is the default value. 

which should read 60 seconds from WMI.

For this example, the computer takes 5 x 2 seconds + 1 additional delay of up to 60 seconds to break the limit. You can then check the application if the last stopwatch counter has exceeded these limits. If so, maybe it's TDR. In addition, there is an upper limit to the thread-exit from the driver,

 KeyPath : HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers KeyValue : TdrDdiDelay ValueType : REG_DWORD ValueData : Number of seconds to leave the driver. 5 seconds is the default value. 

which is 5 seconds by default. Access to an invalid memory segment should expire faster. Perhaps you can increase these TDR time limits from WMI to a few minutes so that it can let the program compute without glitch due to hunger. But modifying the registry can be dangerous, for example, you set a TDR time limit of 1 second or some fragment of it, then windows can never load without constant TDR crashes, so just reading these variables should be safer.

2)

You divide the overall work into much smaller parts. If the data is not separable, copy it once, and then run the long-runnning kernel as kernels with a very short rank n times with some expectation between any two.

Then you must be sure that the TDR is eliminated. If this version works, but the long-term kernel does not work, this is a TDR error. If it is the other way around, it is a memory failure. Looks like that:

 short running x 1024 times long running long running <---- fail? TDR! because memory would crash short ver. too! long running 

One more attempt:

 short running x 1024 times <---- fail? memory! because only 1ms per kernel long running long running long running 

Alternatively, can I at least reliably (in Java / through the OpenCL API) determine that the GPU used for the calculation also triggers the display?

one)

Use the interaction properties of both devices:

 // taken from Intel site: std::vector<cl_device_id> devs (devNum); //reading the info clGetGLContextInfoKHR(props, CL_DEVICES_FOR_GL_CONTEXT_KHR, bytes, devs, NULL)) 

this gives a list of compatible devices. You must get your id to exclude it if you do not want to use it.

2)

Ask another thread to run some opengl or directx static object drawing code so that one of the busy gpus is busy. Then test all gpus at the same time using a different thread for some simple opencl kernel codes. Test:

  • opengl starts to draw something with a high triangle value @ 60 fps.
  • run devices to calculate opencl, get the average number of cores per second
  • device 1: 30 keps
  • device 2: 40 cap
  • after a while, stop opengl and close its windows (if not already)
  • device 1: 75 keps -----> maximum percentage increase! โ†’ display !!!
  • device 2: 41 keps ----> not such a high magnification, but it can

you should not copy any data between devices doing this, therefore CPU / RAM will not be a bottleneck.

3)

If the data is separable, then you can use the separation and rest algorithm to give any gpu its work only when it is available, and allow some of the flexibility to be displayed (since it is a performance-oriented solution and may be similar for the short version, but planning is done on multiple gpus)

4)

I did not check because I sold my second gpu, but you should try

 CL_DEVICE_TYPE_DEFAULT 

on your multi-gpu system to check if it will show gpu or not. Turn off the PC, connect the monitor cable to another board, try again. Turn off, change cards, try again. Turn off, remove one of the cards, so that only 1 gpu and 1 processor remain, try again. If all this gives only the gpu mapping, then it should by default indicate the gpu mapping.

+2
source

Source: https://habr.com/ru/post/1012168/


All Articles