How to efficiently mix data in device memory?

Question

When moving many random (non-coalesced) values ​​inside the global memory of a device, what is the most efficient way to execute it?

Note. Many values, as in> 500.

Context

I have been working on the implementation of the Genetic Algorithm for the GPU for some time now, and I'm trying my best to struggle with the flexibility of my infrastructure and optimize the chips for the GPU architecture. GA data is permanently stored in the GPU. Only copies of the best generations are copied to the host memory.

Detailed scenario

I am optimizing the migration function. Here, most of the data is shuffled into the global memory of the device . But I have a data order in such a way that it is combined for a memory access scheme of the kernel of the GA operator kernels, which forces a couple of “genomes” to be shuffled, as for the step for single FLOAT values ​​and replacing them with another genome in the same step.

Known Solutions

The problem is not the memory bandwidth, but the delay in calls and thread blocks that stop the process.

  • I wrote a couple of device cores whose function is simply to move values ​​between addresses. This would start the kernel (with very low occupancy, divergent code and random access to memory ... so the small code that it ran would be serialized), but would only work with two kernel calls to the device.

    • 1st Kernel Copies values ​​to a buffer array.
    • The values ​​of the 2nd core are changing.
  • I know that I can use cudaMemcpy for each value, but this will require many calls to cudaMemCpy, which I believe are synchronous calls .

Simplified code example:

int needed_genome_idx = 0; // Some random index. for(int nth_gene = 0; nth_gene < num_genes; ++nthgene) { cudaMemcpy(genomes_buffer + nth_gene, src + needed_genome_idx + nth_gene * stride_size, // stride_size being a big number, usually equal to the size of the GA population. sizeof(float), cudaMemCpyDeviceToDevice); } 

Is this a viable solution? Does cudaMemCpyAsync use help performance?

Is there a better way, or at least a more elegant way to do such memory operations?

+6
source share
1 answer

You can try writing a kernel to complete the shuffle, perhaps more efficiently than calling cudaMemcpy so many times.

+2
source

Source: https://habr.com/ru/post/952411/


All Articles