I have an application in which I share the processing load between GPUs in the user system. In principle, there is a processor thread for each graphics processor that initiates the processing interval of the graphics processor during periodic launch on the main application thread.
Consider an example of an image (generated using the NVIDIA CUDA profiling tool) for an example of the processing interval of the GPU - here the application uses one GPU.

As you can see, most GPU processing time is consumed by two sort operations, and for this I use the Thrust library (thrust :: sort_by_key). Also, it looks like thrust :: sort_by_key calls several cudaMallocs under the hood before it starts the actual sort.
Now consider the same processing interval when the application spread the processing load on two GPUs:

In an ideal world, you would expect that the processing interval of 2 GPUs will be one and a half times less than that of one GPU (since each GPU does half the work). As you can see, this is not the case, in part because cudaMallocs seems to take longer when they are called at the same time (sometimes 2-3 times more) because of some kind of contention issue. I donβt understand why this should be so, because the space for allocating memory for 2 GPUs is completely independent, therefore there should not be a system-wide cudaMalloc lock - locking in one GPU would be more reasonable.
To prove my hypothesis that the problem is related to the simultaneous calls to cudaMalloc, I created a ridiculously simple program with two CPU threads (for each GPU), each time calling cudaMalloc several times. First, I ran this program so that individual threads do not call cudaMalloc at the same time:

You see, distribution takes ~ 175 microseconds. Then I ran the program with threads calling cudaMalloc at the same time:

Here, each call took ~ 538 microseconds, or 3 times longer than in the previous case! Needless to say, this slows down my application significantly, and it is reasonable that the problem will only get worse with more than 2 GPUs.
I noticed this behavior on Linux and Windows. On Linux, I use the Nvidia driver version 319.60, and on Windows I use version 327.23. I am using the CUDA toolkit 5.5.
Possible Cause: I use the GTX 690 in these tests. This card mainly consists of 6,680-like GPUs located in one device. This is the only "multi-GPU" setting that I ran, so maybe the cudaMalloc problem is related to some hardware dependency between 690 2 GPUs?