Malloc Function Efficiency in CUDA

I am trying to transfer some processor codes to CUDA. My CUDA card is based on the Fermi architecture, so I can use the malloc () function in the device to dynamically allocate memory and do not have to change the source code much. (The malloc () function is called many times in my codes.) My question is whether this malloc function is efficient enough or we should avoid using it if possible. I don't run my CUDA codes very quickly, and I doubt it is caused by using the malloc () function.

Please let me know if you have any suggestions or comments. I appreciate your help.

+4
source share
1 answer

The current implementation of the malloc device is very slow (publications have been published on the effective allocation of CUDA dynamic memory, but this work has not yet appeared in the AFAIK toolkit release). The memory that it allocates comes from the heap that is stored in global memory, and it is also very slow. Unless you have a very good reason for this, I would recommend that you avoid allocating kernel dynamic memory. This will negatively affect overall performance. In fact, this greatly affects your code - this is a completely separate issue.

+4
source

Source: https://habr.com/ru/post/911306/


All Articles