Poor performance while calling cudaMalloc with two GPUs

I have an application in which I share the processing load between GPUs in the user system. In principle, there is a processor thread for each graphics processor that initiates the processing interval of the graphics processor during periodic launch on the main application thread.

Consider an example of an image (generated using the NVIDIA CUDA profiling tool) for an example of the processing interval of the GPU - here the application uses one GPU.

enter image description here

As you can see, most GPU processing time is consumed by two sort operations, and for this I use the Thrust library (thrust :: sort_by_key). Also, it looks like thrust :: sort_by_key calls several cudaMallocs under the hood before it starts the actual sort.

Now consider the same processing interval when the application spread the processing load on two GPUs:

enter image description here

In an ideal world, you would expect that the processing interval of 2 GPUs will be one and a half times less than that of one GPU (since each GPU does half the work). As you can see, this is not the case, in part because cudaMallocs seems to take longer when they are called at the same time (sometimes 2-3 times more) because of some kind of contention issue. I don’t understand why this should be so, because the space for allocating memory for 2 GPUs is completely independent, therefore there should not be a system-wide cudaMalloc lock - locking in one GPU would be more reasonable.

To prove my hypothesis that the problem is related to the simultaneous calls to cudaMalloc, I created a ridiculously simple program with two CPU threads (for each GPU), each time calling cudaMalloc several times. First, I ran this program so that individual threads do not call cudaMalloc at the same time:

enter image description here

You see, distribution takes ~ 175 microseconds. Then I ran the program with threads calling cudaMalloc at the same time:

enter image description here

Here, each call took ~ 538 microseconds, or 3 times longer than in the previous case! Needless to say, this slows down my application significantly, and it is reasonable that the problem will only get worse with more than 2 GPUs.

I noticed this behavior on Linux and Windows. On Linux, I use the Nvidia driver version 319.60, and on Windows I use version 327.23. I am using the CUDA toolkit 5.5.

Possible Cause: I use the GTX 690 in these tests. This card mainly consists of 6,680-like GPUs located in one device. This is the only "multi-GPU" setting that I ran, so maybe the cudaMalloc problem is related to some hardware dependency between 690 2 GPUs?

+6
source share
2 answers

Summing up the problem and give a possible solution:

The competition between cudaMalloc is probably related to competition at the driver level (possibly due to the need to switch device contexts, as suggested by the prompts), and this additional delay could be avoided in performance-critical sections of cudaMalloc-ing and time buffers in advance.

It seems like I probably need to reorganize my code so that I don't call any sorting procedure that calls cudaMalloc under the hood (in my case thrust :: sort_by_key). The CUB library looks promising in this regard. As a bonus, CUB also provides the user with a CUDA stream parameter, which can also improve performance.

See the CUB (CUDA UnBound) equivalent of thrust :: gather for some details of the transition from traction to CUB.

UPDATE:

I canceled the toust :: sort_by_key calls in favor of cub :: DeviceRadixSort :: SortPairs.
Performing these shaved milliseconds since my processing time during the interval. In addition, the problem of disagreement with several GPUs resolved itself - unloading up to 2 GPUs almost reduces processing time by 50%, as expected.

+4
source

I will predefine this with a disclaimer: I am not tied to the internal components of the NVIDIA driver, so this is somewhat speculative.

The slowdown you see is just driver-level competition caused by competition from multiple threads invoking the malloc device at the same time. To allocate device memory, a number of OS system calls are required, as well as context switching at the driver level. In both operations, there is a nontrivial sum of delays. The extra time that you see when two threads try and allocate memory at the same time is probably caused by an additional delay in the driver from switching from one device to another during the entire sequence of system calls necessary to allocate memory on both devices.

I can come up with several ways in which you can reduce this:

  • You can reduce the system call overhead for memory allocation support to zero by writing your own core skeleton memory allocator for a device that works with a memory plate allocated during initialization. This will save you all the overhead of a system call within each sort_by_key , but the effort of writing your own user memory manager is not trivial. On the other hand, this leaves your traction code.
  • You can switch to an alternative sorting library and return to control the allocation of temporary memory yourself. If you do all the allocation at the initialization stage, the cost of one memory allocation time can be amortized to almost zero during the life of each thread.

In the CUBLAS-based multi-GPU linear algebra codes that I wrote, I combined both ideas and wrote a standalone user device memory manager that works with a one-time distributed device memory pool. I found that removing all the overhead of allocating memory to intermediate devices gave us useful speed. Your use case may benefit from a similar strategy.

+6
source

Source: https://habr.com/ru/post/955279/


All Articles