Nested kernels in CUDA

Question

Nested kernels in CUDA

CUDA does not currently allow nested kernels.

To be specific, I have the following problem: I have N number of M-dimensional data. To process each of the N data points, you must run three cores in a sequence. Since kernel nesting is not allowed, I cannot create a kernel with calls to three kernels. Therefore, I have to process each data point in turn.

One solution is to write a large kernel containing the functionality of all the other three cores, but I think that it will be not optimal.

Can anyone suggest how threads can be used for parallel operation of N data points, while maintaining three smaller cores.

Thank.

+3

arrays cuda

Prasanna Dec 12 '10 at 3:57

source share

3

jmilloy · Answer 1 · 2010-12-14T14:34:00+0000

, ... N :

cudaStream_t streams;
streams = malloc(N * sizeof(cudaStream_t));
for(i=0; i<N; i++)
{
    cudaStreamCreate(&streams[i]);
}

i- cudaMemcpyAsync :

cudaMemcpyAsync(dst, src, kind, count, streams[i]);

(sharedMemory 0, ):

kernel_1 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );
kernel_2 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );

, , :

for(i=0; i<N; i++)
{
    cudaStreamDestroy(streams[i]);
}
free(streams)

Tae-Sung Shin · Answer 2 · 2014-10-02T15:59:26+0000

, NVidia Compute Capability 3.5 Parallelism, .

Dimitri · Answer 3 · 2010-12-24T09:13:49+0000

With Fermi compatibility, you can now run a parallel kernel

Nested kernels in CUDA

More articles: