NVRTC and __device__ Functions

I am trying to optimize my simulator using compilation at runtime. My code is quite long and complex, but I defined a specific __device__ function, the performance of which can be significantly improved by removing all global memory accesses.

Does CUDA allow dynamic compilation and union of one __device__ function (and not __global__ ) to "override" an existing function?

+1
source share
2 answers

I am sure that the really short answer is no.

Although CUDA has support for a dynamic / JIT device, it is important to remember that the binding process itself is still static.

Thus, you cannot defer loading a specific function in an existing compiled GPU payload at runtime, as you can, in a normal dynamic link loading environment. And the linker still requires that at the time of the link there should be one instance of all the objects and characters of the code, whether it is a priori or at run time. Thus, you could freely link together precompiled objects with different versions of the same code, if only one instance of everything is present when the session is completed and the code is loaded into the context. But as much as possible.

+2
source

It looks like you have a “core” core with a part that “switches” at runtime.

You can do this using nvrtc. You will need to do something like this:

  • Instead of compiling the main kernel ahead of time, save it as a string that you need to compile and link at run time.
  • Suppose the main kernel calls "myFunc", which is the kernel of the device selected at runtime.
  • You can create the corresponding "myFunc" kernel based on equations at runtime.
  • Now you can create the nvrtc program using several sources using nvrtcCreateProgram .

What about that. The key is to delay compiling the main kernel until needed at runtime. You can also cache your kernels in some way so that you complete the compilation only once.

There is one problem that I foresee. nvrtc cannot find curand device calls that may cause some problems. One job should be to look at the header that contains the device function call and use nvcc to compile the corresponding device kernel into ptx. You can save the resulting ptx as text and use cuLinkAddData to communicate with your module. You can find more information in this section .

0
source

Source: https://habr.com/ru/post/1258341/


All Articles