Creating a cuda context and associating resources in runtime API applications

Question

Creating a cuda context and associating resources in runtime API applications

I want to understand how the cuda context is created and connected to the kernel in cuda API applications?

I know this is done under the hood using the driver APIs. But I would like to understand the timing of the creation.

To begin with, I know that cudaRegisterFatBinary is the first call to cuda api, and it registers a fatbin file with runtime. It is followed by several cuda function registration APIs that call cuModuleLoad in the driver layer. But then, if my Cuda Runtime API application calls cudaMalloc, as a pointer, provided to this function, related to the context, which, I believe, should have been created in advance. How to get a handle to this already created context and associate future APIs with it? Please demystify the inner workings.

Quote NVIDIA documentation for this

CUDA Runtime API calls work in the CUDA interface of the CUDA Driver CUcontext, which is bound to the current host stream.
If there is no CUDA Driver API CUcontext associated with the current thread during a call to the CUDA Runtime API that requires CUcontext, then CUDA runtime will implicitly create a new CUcontext before making the call.
If CUDA Runtime creates CUcontext, then CUcontext be created by using parameters defined API-interface CUDA Runtime cudaSetDevice function, cudaSetValidDevices, cudaSetDeviceFlags, cudaGLSetGLDevice, cudaD3D9SetDirect3DDevice, cudaD3D10SetDirect3DDevice and cudaD3D11SetDirect3DDevice. Note that cudaErrorSetOnActiveProcess, these functions will fail if they are called when the CUcontext is bound to the current host thread.
CUcontext's lifetime is controlled by a link counting mechanism. The reference counter for CUcontext is initially set to 0, and incremented using cuCtxAttach and decremented using cuCtxDetach.
If the CU context is created by the CUDA Runtime, then the CUDA runtime will decrease the reference count of this CUcontext in the cudaThreadExit function. If the CUcontext is created by the CUDA Driver API (or created by a separate instance of the CUDA Runtime API), the CUDA runtime will not increase or decrease the reference value by counting that CUcontext.
All states of the CUDA Runtime API (for example, addresses of global variables and values) are moved with their main CUcontext. In particular, if a CUcontext moves from one thread to another (using cuCtxPopCurrent and cuCtxPushCurrent), then all CUDA Runtime API conditions will go to that thread as well.

But I do not understand how cuda runtime creates context? What API calls are used for this? Does the nvcc compiler insert some API calls to do this at compile time or is it done completely at runtime? If the first is true, then what runtime APIs are used for this context management? This is a later truth, how exactly is this done?

If a context is associated with a host thread, how do we access this context? Is it automatically associated with all variables and pointer references processed by the stream?

how does the module end up loading in context?

+6

cuda

ash Sep 23 '11 at 21:09

source share

1 answer

Archaeasoftware · Accepted Answer · 2011-09-24T18:43:49+0000

CUDA Runtime maintains a global list of modules for loading and adds to this list every time a DLL or .so is loaded into the process using the CUDA runtime. But the modules do not actually load until the device is created.

Creating and initializing the context is done “lazily” by the CUDA runtime - each time you call the cudaMemcpy () function, it checks whether CUDA has been initialized, and if not, a context is created (on the device previously specified by cudaSetDevice (), or on the device by default, if cudaSetDevice () has never been called) and loads all modules. The context is associated with this CPU thread from now until it changes cudaSetDevice ().

You can use context / flow control functions from the driver API such as cuCtxPopCurrent () / cuCtxPushCurrent () to use context from another thread.

You can call cudaFree (0); to make this lazy initialization happen.

I would highly recommend doing this during application initialization to avoid race conditions and undefined behavior. Go ahead, list and initialize the devices as early as possible in your application; once this is done, in CUDA 4.0 you can call cudaSetDevice () from any CPU thread, and it will select the appropriate context that your initialization code created.

Creating a cuda context and associating resources in runtime API applications

More articles: