Allocation of memory inside the CUDA kernel

I have the following (fragment) kernel.

__global__ void plain(int* geneVec, float* probs, int* nComponents, float* randomNumbers,int *nGenes) { int xid = threadIdx.x + (blockDim.x * blockIdx.x); float* currentProbs= (float*)malloc(sizeof(float)*tmp); ..... ..... currentProbs[0] = probs[start]; for (k=1;k<nComponents[0]; k++) { currentProbs[k] = currentProbs[k-1] + prob; } ... ... free(currentProbs); } 

When it is static (even the same size), it is very fast, but with the dynamic allocation of CurrentProbs (as indicated above) the performance is terrible.

In this question, I can do this inside the kernel: CUDA allocates memory in the __device__ function

Here's a related question: The effectiveness of the Malloc function in CUDA

I was wondering if other methods other than those proposed in the document have been decided? It seems ridiculous that it is impossible without malloc / free inside the kernel without such a penalty.

+6
source share
1 answer

I think the reason malloc () slows down your code is because it allocates memory in global memory. When you use a fixed-size array, the compiler will most likely put it in a register file, which is much faster.

Having malloc inside your kernel may mean that you are trying to do too much work with one core. If each thread allocates a different amount of memory, then each thread executes a different number of times in a for loop, and you get many discrepancies in warp.

If each thread in warp works with loops the same number of times, just highlight the front. Even if they work a different number of times, you can use a constant size. But instead, I think you should look at how you can reorganize your code to completely remove this loop from your kernel.

+7
source

Source: https://habr.com/ru/post/911303/


All Articles