This seems wrong, but some things to check first are: register usage. Nvidia GPUs support verbose output. Pass this to the clBuildProgram command, and then check the build log. Something like that:
clBuildProgram(program, 1, &device_id, "-cl-nv-verbose", NULL, NULL);
This is described in the cl_nv_compiler_options section. View the maximum number of registers for your device in CUDA documents. What can happen is that the total number of registers required by the work item block is more than is available in one SM / SMX, which leads to an error.
If the use of registers is not a problem, perhaps it could be access to memory outside the boundaries. I do not know that the error message does not suggest this, but I have experienced such errors before. Such a mistake can be anywhere, and it is much more difficult to find.
source share