There really is little information about this, but you overestimate the effect.
All kernel code is downloaded to the GPU only once (in the worst case, once per kernel call), but it looks like it really is once to run the application, see below), and then it runs completely on the GPU without processor intervention. Thus, the entire kernel code is copied into one piece somewhere before the kernel is called. To estimate the code size, the .cubin size of the entire GPU code of our home MD package (52 cores, some of which are> 150 lines of code) is only 91 KiB, so we can safely assume that to a large extent all cases where the transmission time the code is negligible.
Here is the information I found in the official docs:
In the CUDA driver API, the code is loaded onto the device during the call to the cuModuleLoad function
The CUDA driver API does not lazily distribute the resources the module needs; if the memory for functions and data (constant and global) required by the module cannot be allocated, cuModuleLoad () does not work
Theoretically, you might have to unload a module and then load it again if you have several modules that use too much constant (or statically allocated global) memory to load at the same time, but this is rather unusual, and you usually call cuModuleLoad only one times to launch the application immediately after creating the context.
The CUDA Runtime API does not provide any measures for loading / unloading the module, but it looks like all the necessary code is downloaded to the device during its initialization.
The OpenCL specifications are not as specific as the CUDA API, but the code is most likely copied to the device at the clBuildProgram stage.
aland source share