This may be device specific, and I'm talking from the experience of the Intel GPU. The program area resources will be visible only to the cores in this program. Beyond this, register allocation is the core; therefore, 1 core in K programs against K kernels in 1 program does not affect register pressure. You create and link each program. Therefore, compiling K kernels in one program is less efficient in terms of startup time if you do not use all K kernels.
source
share