What is the best strategy for overlapping kernel execution and transferring data on a GTX Titan card?

Question

What is the best strategy for overlapping kernel execution and transferring data on a GTX Titan card?

When I try to combine data transfer and kernel execution, it seems that the card performs all the memory transfers in order, no matter which thread I use.

So, if I output the following:

stream 1: MemcpyA_HtoD_1; Kernel_1; MemcpyA_DtoH_1
stream 2: MemcpyA_HtoD_2; Kernel_2; MemcpyA_DtoH_2

MemcpyA_HtoD_2 will wait for MemcpyA_DtoH_1 to complete . Thus, overlap is not achieved. No matter what thread configuration I use, Memcpy operations are always issued in order. Thus, the only way to achieve overlap is to buffer the outputs or delay the transmission of the output until the next iteration.

I am using CUDA 5.5, Windows 7 x64 and GTX Titan. All cpu memory is pinned, and data_transfers using the asynchronous version.

See the following behavior screens:

issuing, host_to_device -> kernel -> device_to_host (normal behavior) and cannot get overlapping.

non overlapping

issuing host_to_device -> kernel (device_to_host exception after the kernel) gets an overlap ... because all copies of the memory are executed in order, regardless of what kind of thread configuration I am trying to do.

overlapping

UPDATE

If anyone is interested in reproducing this problem, I have encoded a synthetic program that shows this unwanted behavior. Its a complete VS2010 solution using CUDA 5.5

VS2010 Working link does not work

Can someone do this on linux to check for overlap?

#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #define N 1024*1024 __global__ void someKernel(int *d_in, int *d_out) { for (int i = threadIdx.x; i < threadIdx.x + 1024; i++) { d_out[i] = d_in[i]; } } int main () { int *h_bufferIn[100]; int *h_bufferOut[100]; int *d_bufferIn[100]; int *d_bufferOut[100]; //allocate some memory for (int i = 0; i < 100; i++) { cudaMallocHost(&h_bufferIn[i],N*sizeof(int)); cudaMallocHost(&h_bufferOut[i],N*sizeof(int)); cudaMalloc(&d_bufferIn[i], N*sizeof(int)); cudaMalloc(&d_bufferOut[i], N*sizeof(int)); } //create cuda streams cudaStream_t st[2]; cudaStreamCreate(&st[0]); cudaStreamCreate(&st[1]); //trying to overlap computation and memcpys for (int i = 0; i < 100; i+=2) { cudaMemcpyAsync(d_bufferIn[i], h_bufferIn[i], N*sizeof(int), cudaMemcpyHostToDevice, st[i%2]); someKernel<<<1,256, 0, st[i%2]>>>(d_bufferIn[i], d_bufferOut[i]); cudaMemcpyAsync(h_bufferOut[i], d_bufferOut[i], N*sizeof(int), cudaMemcpyDeviceToHost, st[i%2]); cudaStreamQuery(0); cudaMemcpyAsync(d_bufferIn[i+1], h_bufferIn[i+1], N*sizeof(int), cudaMemcpyHostToDevice, st[(i+1)%2]); someKernel<<<1,256, 0, st[(i+1)%2]>>>(d_bufferIn[i+1], d_bufferOut[i+1]); cudaMemcpyAsync(h_bufferOut[i+1], d_bufferOut[i+1], N*sizeof(int), cudaMemcpyDeviceToHost, st[(i+1)%2]); cudaStreamQuery(0); } cudaDeviceSynchronize(); }

+4

cuda

Dredok Jul 10 '13 at 7:29

source share

1 answer

Aperture laboratories · Answer 1 · 2015-04-21T14:21:42+0000

TL DR: The problem is caused by the TDR WDDM delay option in Nsight Monitor! If set to false, a problem will appear. Instead, if you set the TDR delay value to a very large number, and the “on” parameter is true, the problem disappears.

Read below for other (older) steps following how I came to the solution above, and some other possible reasons.

Recently, I was able to partially solve this problem! This is typical of windows and aero, I think. Try these steps and post your results to help others! I tried this on a GTX 650 and GT 640.

Before you do anything, consider using both built-in gpu (as a display) and discrete gpu (for computing) , as there are proven problems with the nvidia driver for windows! When you use the built-in gpu, the specified drivers do not load completely, so many errors are avoided. In addition, the efficiency of the system is maintained during operation!

Verify that the concurrency issue is not related to other issues, such as older drivers (including BIOS), etc.
Go to computer> properties
Select advanced system settings on the left.
Click the Advanced tab.
In performance settings, click
On the "Visual Effects" tab, select the "adjust best performance" brand.

This will disable aero and almost all visual effects. If this configuration works, you can try to turn on the boxes for visual effects in turn until you find the exact one that causes the problems!

Alternatively, you can:

Right-click on the desktop, select personalization
Choose a theme from the main themes that do not have aero.

This will also work as described above, but with wider visualization capabilities. For my two devices, this setting also works, so I saved it.

Please, when you try these solutions, come back here and post your findings!

<y> For me, this solved the problem for most cases (laid out by dgemm, which I did) , but the NOTE that I still cannot run "simpleStreams" correctly and achieve concurrency ...

UPDATE: The problem is completely resolved with a new installation of Windows ! The previous steps improved the behavior for some cases, but ONLY a new installation solved ALL the problems!

I will try to find a less radical way to solve this problem, perhaps restoring only the registry will be enough.

What is the best strategy for overlapping kernel execution and transferring data on a GTX Titan card?

More articles: