How to eliminate bottlenecks for reconfiguring host + device memory in OpenCL / CUDA

If my algorithm is the host bottleneck for the device and the device to transfer memory to memory, is the only solution to another or modified algorithm?

+3
source share
2 answers

There are a few things you can try to mitigate the PCIe bottleneck:

  • Asynchronous transfers - allow you to perform overlapping calculations and mass transfer.
  • Mapped memory - allows the kernel to transfer data to / from the GPU at runtime

, , GPU .

cudaMemcpyAsync API , , , , . , , .

API cudaHostAlloc , . , , , , , . , . Mapped memory , .

3.2.6-3.2.7 CUDA 3.1 CUDA. 3 OpenCL Best Practices Guide , OpenCL.

+5

, , , . , - , .

+3

Source: https://habr.com/ru/post/1770393/


All Articles