Joint execution of CUDA kernels with multiple cores per thread

Using different threads for CUDA cores makes it possible to run the kernel simultaneously. Therefore, kernels n on threads n can theoretically start simultaneously if they fit the hardware, right?

Now I am faced with the following problem: there are no n different kernels, but n*m where it is necessary that the m kernels run in order. For example, n=2 and m=3 will lead to the following flowchart:

 Stream 1: <<<Kernel 0.1>>> <<<Kernel 1.1>>> <<<Kernel 2.1>>> Stream 2: <<<Kernel 0.2>>> <<<Kernel 1.2>>> <<<Kernel 2.2>>> 

My naive assumption is that the kernels x.0 and y.1 should be executed simultaneously (from a theoretical point of view) or, at least, not sequentially (from a practical point of view). But my measurements show me that this is not the case, and it seems that sequential execution is being performed (i.e. K0.0, K1.0, K2.0, K0.1, K1.1, K2.1). The kernels themselves are very small, so simultaneous execution should not be a problem.

Now my approach would be to do some scheduling to ensure that the kernels are placed in alternate style in the scheduler on the GPU. But when working with a large number of threads / cores, this can do more harm than good.

Well, straight to the point: what would be a suitable (or at least different) approach to solving this situation?

Edit: Measurements are performed using CUDA events. I measured the time required to completely solve the calculations, i.e. e. The GPU must compute all n * m kernels. The following is assumed: when the kernel is completely parallel, the execution time is approximately (ideally) 1/n times the time required to execute all the kernels in order, as a result of which it should be possible to simultaneously execute two or more kernels. I guarantee this using only two different threads right now.

I can measure a clear difference in runtime between using threads as described and dispatching alternating cores, i. e :.

 Loop: i = 0 to m EnqueueKernel(Kernel i.1, Stream 1) EnqueueKernel(Kernel i.2, Stream 2) 

against

 Loop: i = 1 to n Loop: j = 0 to m EnqueueKernel(Kernel ji, Stream i) 

The latter leads to longer execution times.

Edit # 2: Changed thread numbers starting with 1 (instead of 0, see comments below).

Edit # 3: Hardware is NVIDIA Tesla M2090 (e.g. Fermi, Compute 2.0)

+6
source share
1 answer

In a Fermi device (aka Compute Capability 2.0), it is best to alternate the kernel launch into several threads, rather than starting all the kernels in one thread, then the next thread, etc. This is due to the fact that the hardware can immediately launch kernels for different ones if there are sufficient resources, whereas if subsequent launches belong to the same thread, a delay often occurs, reducing concurrency. It is for this reason that your first approach works better, and this approach is the one you should choose.

Enabling profiling can also disable concurrency on Fermi, so be careful with this. Also, be careful about using CUDA events during the launch cycle, as they can interfere - it is best for the entire cycle to use events as you do, for example.

+5
source

Source: https://habr.com/ru/post/908613/


All Articles