Understanding CUDA Kernel Launch Options

I am trying to analyze some code that I found on the Internet, and I keep thinking about myself in the corner. I am looking at the core of a histogram launched with the following parameters

histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...); 

I know that the parameters are the size of the grid, block, shared memory.

Does this mean that there are 2500 blocks of numBins threads each, each block also has a piece of numBins * sizeof(unsigned int) shared memory available for its threads?

In addition, there are __syncthreads() calls inside the kernel itself, are there then 2500 sets of numBins calls to __syncthreads() during the kernel call?

+5
source share
1 answer

So this means that there are 2500 blocks of numBins threads each, each block also has the number numBins * sizeof (unsigned int) available for its threads?

From the CUDA Toolkit Documentation :

The execution configuration (calling a global function) is specified by inserting an expression of the form <<<Dg,Db,Ns,S>>> , where:

  • Dg (dim3) determines the size and size of the grid.
  • Db (dim3) determines the size and size of each block.
  • Ns (size_t) indicates the number of bytes in shared memory dynamically allocated to a block for this call in addition to statically allocated memory.
  • S (cudaStream_t) indicates the associated stream, is an optional parameter, which defaults to 0.

So, as @Fazar pointed out, the answer is yes. This memory is allocated to each block.

In addition, inside the kernel itself there are __syncthreads () calls, are there then 2500 sets of numBins calls __syncthreads () according to the kernel call rate?

__syncthreads() waits until all threads in the thread block have reached this point. Used to coordinate communication between threads in one block.

So there is a __syncthread() call for each block.

+4
source

Source: https://habr.com/ru/post/1206310/


All Articles