I want to understand how the cuda context is created and connected to the kernel in cuda API applications?
I know this is done under the hood using the driver APIs. But I would like to understand the timing of the creation.
To begin with, I know that cudaRegisterFatBinary is the first call to cuda api, and it registers a fatbin file with runtime. It is followed by several cuda function registration APIs that call cuModuleLoad in the driver layer. But then, if my Cuda Runtime API application calls cudaMalloc, as a pointer, provided to this function, related to the context, which, I believe, should have been created in advance. How to get a handle to this already created context and associate future APIs with it? Please demystify the inner workings.
Quote NVIDIA documentation for this
CUDA Runtime API calls work in the CUDA interface of the CUDA Driver CUcontext, which is bound to the current host stream.
If there is no CUDA Driver API CUcontext associated with the current thread during a call to the CUDA Runtime API that requires CUcontext, then CUDA runtime will implicitly create a new CUcontext before making the call.
If CUDA Runtime creates CUcontext, then CUcontext be created by using parameters defined API-interface CUDA Runtime cudaSetDevice function, cudaSetValidDevices, cudaSetDeviceFlags, cudaGLSetGLDevice, cudaD3D9SetDirect3DDevice, cudaD3D10SetDirect3DDevice and cudaD3D11SetDirect3DDevice. Note that cudaErrorSetOnActiveProcess, these functions will fail if they are called when the CUcontext is bound to the current host thread.
CUcontext's lifetime is controlled by a link counting mechanism. The reference counter for CUcontext is initially set to 0, and incremented using cuCtxAttach and decremented using cuCtxDetach.
If the CU context is created by the CUDA Runtime, then the CUDA runtime will decrease the reference count of this CUcontext in the cudaThreadExit function. If the CUcontext is created by the CUDA Driver API (or created by a separate instance of the CUDA Runtime API), the CUDA runtime will not increase or decrease the reference value by counting that CUcontext.
All states of the CUDA Runtime API (for example, addresses of global variables and values) are moved with their main CUcontext. In particular, if a CUcontext moves from one thread to another (using cuCtxPopCurrent and cuCtxPushCurrent), then all CUDA Runtime API conditions will go to that thread as well.
But I do not understand how cuda runtime creates context? What API calls are used for this? Does the nvcc compiler insert some API calls to do this at compile time or is it done completely at runtime? If the first is true, then what runtime APIs are used for this context management? This is a later truth, how exactly is this done?
If a context is associated with a host thread, how do we access this context? Is it automatically associated with all variables and pointer references processed by the stream?
how does the module end up loading in context?