I highlight some floating point arrays (quite large, i.e. 9,000,000 items) on the GPU using cudaMalloc((void**)&(storage->data), size * sizeof(float)) . At the end of my program, I will free this memory using cudaFree(storage->data); .
The problem is that the first release is very slow, about 10 seconds, while others are almost instantaneous.
My question is this: what can cause this difference? Is disadaptation memory on the GPU generally slow?
source share