I am trying to transfer some processor codes to CUDA. My CUDA card is based on the Fermi architecture, so I can use the malloc () function in the device to dynamically allocate memory and do not have to change the source code much. (The malloc () function is called many times in my codes.) My question is whether this malloc function is efficient enough or we should avoid using it if possible. I don't run my CUDA codes very quickly, and I doubt it is caused by using the malloc () function.
Please let me know if you have any suggestions or comments. I appreciate your help.
source share