When I try to combine data transfer and kernel execution, it seems that the card performs all the memory transfers in order, no matter which thread I use.
So, if I output the following:
- stream 1: MemcpyA_HtoD_1; Kernel_1; MemcpyA_DtoH_1
- stream 2: MemcpyA_HtoD_2; Kernel_2; MemcpyA_DtoH_2
MemcpyA_HtoD_2 will wait for MemcpyA_DtoH_1 to complete . Thus, overlap is not achieved. No matter what thread configuration I use, Memcpy operations are always issued in order. Thus, the only way to achieve overlap is to buffer the outputs or delay the transmission of the output until the next iteration.
I am using CUDA 5.5, Windows 7 x64 and GTX Titan. All cpu memory is pinned, and data_transfers using the asynchronous version.
See the following behavior screens:
issuing, host_to_device -> kernel -> device_to_host (normal behavior) and cannot get overlapping.

issuing host_to_device -> kernel (device_to_host exception after the kernel) gets an overlap ... because all copies of the memory are executed in order, regardless of what kind of thread configuration I am trying to do.

UPDATE
If anyone is interested in reproducing this problem, I have encoded a synthetic program that shows this unwanted behavior. Its a complete VS2010 solution using CUDA 5.5
VS2010 Working link does not work
Can someone do this on linux to check for overlap?
#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #define N 1024*1024 __global__ void someKernel(int *d_in, int *d_out) { for (int i = threadIdx.x; i < threadIdx.x + 1024; i++) { d_out[i] = d_in[i]; } } int main () { int *h_bufferIn[100]; int *h_bufferOut[100]; int *d_bufferIn[100]; int *d_bufferOut[100]; //allocate some memory for (int i = 0; i < 100; i++) { cudaMallocHost(&h_bufferIn[i],N*sizeof(int)); cudaMallocHost(&h_bufferOut[i],N*sizeof(int)); cudaMalloc(&d_bufferIn[i], N*sizeof(int)); cudaMalloc(&d_bufferOut[i], N*sizeof(int)); } //create cuda streams cudaStream_t st[2]; cudaStreamCreate(&st[0]); cudaStreamCreate(&st[1]); //trying to overlap computation and memcpys for (int i = 0; i < 100; i+=2) { cudaMemcpyAsync(d_bufferIn[i], h_bufferIn[i], N*sizeof(int), cudaMemcpyHostToDevice, st[i%2]); someKernel<<<1,256, 0, st[i%2]>>>(d_bufferIn[i], d_bufferOut[i]); cudaMemcpyAsync(h_bufferOut[i], d_bufferOut[i], N*sizeof(int), cudaMemcpyDeviceToHost, st[i%2]); cudaStreamQuery(0); cudaMemcpyAsync(d_bufferIn[i+1], h_bufferIn[i+1], N*sizeof(int), cudaMemcpyHostToDevice, st[(i+1)%2]); someKernel<<<1,256, 0, st[(i+1)%2]>>>(d_bufferIn[i+1], d_bufferOut[i+1]); cudaMemcpyAsync(h_bufferOut[i+1], d_bufferOut[i+1], N*sizeof(int), cudaMemcpyDeviceToHost, st[(i+1)%2]); cudaStreamQuery(0); } cudaDeviceSynchronize(); }