According to the CUDA programming guide, you can disable the start of the asynchronous kernel at run time by setting the environment variable (CUDA_LAUNCH_BLOCKING = 1).
This is a useful debugging tool. I also want to determine the advantage of my code in using parallel cores and translations.
I also want to disable other simultaneous calls, in particular cudaMemcpyAsync.
Do CUDA_LAUNCH_BLOCKINGthese kinds of calls affect the addition of starting the kernel? I suspect not. What would be the best alternative? I can add calls cudaStreamSynchronize, but I would prefer a runtime solution. I can run in the debugger, but it will affect the time and defeat the goal.
source
share