I am just familiar with these Nvidia APIs, and some expressions are not so clear to me. I was wondering if anyone could help me figure out when and how to use these CUDA commands in a simple way. More precisely:
Studying how to speed up some applications with parallel kernel execution (for example, with CUDA), at some point I came across the problem of accelerating the interaction of Host-Device. I have some information taken online, but I'm a little confused. It is clear that you can go faster when you can use cudaHostRegister()and / or cudaHostAlloc(). It explains that
"you can use the command cudaHostRegister()to take some data (already allocated) and output it, avoiding an extra copy to take in the GPU."
What is the meaning of "pin the memory"? Why so fast? How can I do this earlier in this area? After following the link in the same video, they continue to explain that
"if you transfer PINNED memory, you can use asynchronous data transfer cudaMemcpyAsync(), which allows the processor to continue to work during the transfer of memory."
Is a PCIe transaction running completely from the CPU? Is there a bus manager who will take care of this? Partial answers are also really appreciated to re-compose the puzzle at the end.
It is also important to have a reference to equivalent APIs in OpenCL.