Question
When moving many random (non-coalesced) values inside the global memory of a device, what is the most efficient way to execute it?
Note. Many values, as in> 500.
Context
I have been working on the implementation of the Genetic Algorithm for the GPU for some time now, and I'm trying my best to struggle with the flexibility of my infrastructure and optimize the chips for the GPU architecture. GA data is permanently stored in the GPU. Only copies of the best generations are copied to the host memory.
Detailed scenario
I am optimizing the migration function. Here, most of the data is shuffled into the global memory of the device . But I have a data order in such a way that it is combined for a memory access scheme of the kernel of the GA operator kernels, which forces a couple of “genomes” to be shuffled, as for the step for single FLOAT values and replacing them with another genome in the same step.
Known Solutions
The problem is not the memory bandwidth, but the delay in calls and thread blocks that stop the process.
I wrote a couple of device cores whose function is simply to move values between addresses. This would start the kernel (with very low occupancy, divergent code and random access to memory ... so the small code that it ran would be serialized), but would only work with two kernel calls to the device.
- 1st Kernel Copies values to a buffer array.
- The values of the 2nd core are changing.
I know that I can use cudaMemcpy for each value, but this will require many calls to cudaMemCpy, which I believe are synchronous calls .
Simplified code example:
int needed_genome_idx = 0;
Is this a viable solution? Does cudaMemCpyAsync use help performance?
Is there a better way, or at least a more elegant way to do such memory operations?
source share