Decreased CUDA performance on Windows

Question

Decreased CUDA performance on Windows

I noticed a big performance hit when I run the CUDA application on Windows 7 (compared to Linux). I think I know where the slowdown happens. For some reason, the Windows Nvidia driver (version 331.65) does not immediately send the CUDA kernel when called through the runtime API. To illustrate the problem, I profiled the mergeSort application (from the examples that ship with CUDA 5.5).

First, consider the kernel startup time when working on Linux:

linux_launch

Then consider the startup time when working on Windows:

windows_launch

This post indicates that the problem may have something to do with the Windows driver starting the kernel. Is there any way to disable this package?

I am running the GTX 690 GPU, Windows 7, and Nvidia driver version 331.65.

+6

linux windows gpu cuda

rmccabe3701 Nov 13 '13 at 3:01

source share

1 answer

Robert Crovella · Accepted Answer · 2013-11-13T03:28:39+0000

There is a fair amount of overhead when sending hardware GPU commands through the WDDM stack.

As you found out, this means that in WDDM mode (only), GPU commands can be “packaged” to absorb this overhead. The dosing process may (probably will) introduce some delay, which may be variable, depending on what else is happening.

The best solution for Windows is to switch the GPU from WDDM to TCC, which can be done using the nvidia-smi , but is only supported on Tesla GPUs and some members of the Quadro GPU family - that is, not GeForce. (It also has the side effect of preventing the device from being used as an adapter for faster window display, which may be related to the Quadro device or some specific old Fermi Tesla GPUs.)

AFAIK there is no officially registered method to bypass or affect the WDDM batch process in the driver, but unofficially I heard, according to Greg @NV, in this link , which the command gives after calling the cuda kernel cudaEventQuery(0); , which can / should cause the WDDM packet queue to merge with the GPU.

As Greg points out, the widespread use of this mechanism will destroy the benefits of depreciation and can do more harm than good.

EDIT: moving forward by 2016, a newer recommendation for a "low-level" WDDM command queue flush would be cudaStreamQuery(stream);

EDIT2: Using recent drivers for Windows, you can host the Titan family of GPUs in TCC mode, assuming you have another GPU configured for the main display. The nvidia-smi will allow you to switch modes (using nvidia-smi --help for more information).

Further information on the TCC driver model can be found in the Windows installation guide , including that it can reduce the latency of the kernel launch.

Decreased CUDA performance on Windows

More articles: