I think the reason malloc () slows down your code is because it allocates memory in global memory. When you use a fixed-size array, the compiler will most likely put it in a register file, which is much faster.
Having malloc inside your kernel may mean that you are trying to do too much work with one core. If each thread allocates a different amount of memory, then each thread executes a different number of times in a for loop, and you get many discrepancies in warp.
If each thread in warp works with loops the same number of times, just highlight the front. Even if they work a different number of times, you can use a constant size. But instead, I think you should look at how you can reorganize your code to completely remove this loop from your kernel.
source share