Using different threads for CUDA cores makes it possible to run the kernel simultaneously. Therefore, kernels n
on threads n
can theoretically start simultaneously if they fit the hardware, right?
Now I am faced with the following problem: there are no n
different kernels, but n*m
where it is necessary that the m
kernels run in order. For example, n=2
and m=3
will lead to the following flowchart:
Stream 1: <<<Kernel 0.1>>> <<<Kernel 1.1>>> <<<Kernel 2.1>>> Stream 2: <<<Kernel 0.2>>> <<<Kernel 1.2>>> <<<Kernel 2.2>>>
My naive assumption is that the kernels x.0 and y.1 should be executed simultaneously (from a theoretical point of view) or, at least, not sequentially (from a practical point of view). But my measurements show me that this is not the case, and it seems that sequential execution is being performed (i.e. K0.0, K1.0, K2.0, K0.1, K1.1, K2.1). The kernels themselves are very small, so simultaneous execution should not be a problem.
Now my approach would be to do some scheduling to ensure that the kernels are placed in alternate style in the scheduler on the GPU. But when working with a large number of threads / cores, this can do more harm than good.
Well, straight to the point: what would be a suitable (or at least different) approach to solving this situation?
Edit: Measurements are performed using CUDA events. I measured the time required to completely solve the calculations, i.e. e. The GPU must compute all n * m
kernels. The following is assumed: when the kernel is completely parallel, the execution time is approximately (ideally) 1/n
times the time required to execute all the kernels in order, as a result of which it should be possible to simultaneously execute two or more kernels. I guarantee this using only two different threads right now.
I can measure a clear difference in runtime between using threads as described and dispatching alternating cores, i. e :.
Loop: i = 0 to m EnqueueKernel(Kernel i.1, Stream 1) EnqueueKernel(Kernel i.2, Stream 2)
against
Loop: i = 1 to n Loop: j = 0 to m EnqueueKernel(Kernel ji, Stream i)
The latter leads to longer execution times.
Edit # 2: Changed thread numbers starting with 1 (instead of 0, see comments below).
Edit # 3: Hardware is NVIDIA Tesla M2090 (e.g. Fermi, Compute 2.0)