There is a fair amount of overhead when sending hardware GPU commands through the WDDM stack.
As you found out, this means that in WDDM mode (only), GPU commands can be βpackagedβ to absorb this overhead. The dosing process may (probably will) introduce some delay, which may be variable, depending on what else is happening.
The best solution for Windows is to switch the GPU from WDDM to TCC, which can be done using the nvidia-smi , but is only supported on Tesla GPUs and some members of the Quadro GPU family - that is, not GeForce. (It also has the side effect of preventing the device from being used as an adapter for faster window display, which may be related to the Quadro device or some specific old Fermi Tesla GPUs.)
AFAIK there is no officially registered method to bypass or affect the WDDM batch process in the driver, but unofficially I heard, according to Greg @NV, in this link , which the command gives after calling the cuda kernel cudaEventQuery(0); , which can / should cause the WDDM packet queue to merge with the GPU.
As Greg points out, the widespread use of this mechanism will destroy the benefits of depreciation and can do more harm than good.
EDIT: moving forward by 2016, a newer recommendation for a "low-level" WDDM command queue flush would be cudaStreamQuery(stream);
EDIT2: Using recent drivers for Windows, you can host the Titan family of GPUs in TCC mode, assuming you have another GPU configured for the main display. The nvidia-smi will allow you to switch modes (using nvidia-smi --help for more information).
Further information on the TCC driver model can be found in the Windows installation guide , including that it can reduce the latency of the kernel launch.