Command transfer between CPU and GPU

Question

Command transfer between CPU and GPU

I am looking for information on how a processor moves program code to a GPU when working with GPGPU calculation. There are many data transfer guides on the Internet, but not about downloading instructions / programs.

Question: The program is processed by the CPU, which “tunes” the graphics processor with the corresponding flags on each computing device to perform this operation. After that, the data is transferred and processed. How is the first operation performed? How are instructions issued on the GPU? Are instructions some kind of package to take advantage of bus bandwidth? Perhaps I ignored something fundamental, so any additional information is welcome.

+4

gpu gpgpu data-transfer gpu-programming

amnl Feb 16 '12 at 9:26

source share

1 answer

aland · Answer 1 · 2012-02-28T15:49:02+0000

There really is little information about this, but you overestimate the effect.

All kernel code is downloaded to the GPU only once (in the worst case, once per kernel call), but it looks like it really is once to run the application, see below), and then it runs completely on the GPU without processor intervention. Thus, the entire kernel code is copied into one piece somewhere before the kernel is called. To estimate the code size, the .cubin size of the entire GPU code of our home MD package (52 cores, some of which are> 150 lines of code) is only 91 KiB, so we can safely assume that to a large extent all cases where the transmission time the code is negligible.

Here is the information I found in the official docs:

In the CUDA driver API, the code is loaded onto the device during the call to the cuModuleLoad function

The CUDA driver API does not lazily distribute the resources the module needs; if the memory for functions and data (constant and global) required by the module cannot be allocated, cuModuleLoad () does not work

Theoretically, you might have to unload a module and then load it again if you have several modules that use too much constant (or statically allocated global) memory to load at the same time, but this is rather unusual, and you usually call cuModuleLoad only one times to launch the application immediately after creating the context.

The CUDA Runtime API does not provide any measures for loading / unloading the module, but it looks like all the necessary code is downloaded to the device during its initialization.

The OpenCL specifications are not as specific as the CUDA API, but the code is most likely copied to the device at the clBuildProgram stage.

Command transfer between CPU and GPU

More articles: