The number of CUDA cores and the number of threads

Question

The number of CUDA cores and the number of threads

I am confused by the relationship between the number of cores in the NVidia GPU, the number of SMPs and the maximum number of threads. The device properties for my GT650m laptop show 384 cores, 2 SMP, with 1024 threads per SMP.

How are these numbers related to each other and strain size? I assume (possibly wrong) that there are 192 cores on the SMP, but that is not a coefficient of 1024. If each core runs the basics of 32 threads, I would expect 32 * 192 threads on the SMP or 2 ^ 5 * (2 ^ 7 + 2 ^ 6), or 4096 + 2048 = 6142.

What am I missing?

+6

architecture hardware cuda

David Lively Jun 07 '13 at 14:39

source share

1 answer

Michael · Accepted Answer · 2013-06-07T15:01:44+0000

I think you need to take a deeper look at kernel planning in cuda. A.

There are two important sizes: blocks and threads per block

Each block is planned for one SM and then cut into skews. Therefore, the blocks have shared memory, accessible only inside the block, because it lies on the SM memory. The number of blocks for each SM depends on the device constraint and occupancy calculation. The maximum blocks on SM are 8 for CC 1.0-2.x and 16 for CC 3.x.

Each block has a certain number of threads per block . Streams are separated in deformations, and deformations can be performed in an arbitrary order defined only by deformation, scheduler and SM.

Your card now has a total of 384 cores per 2 SM with 192 cores each. The CUDA kernel count is the total number of floating point instructions or single precision integer points that can be executed per cycle. Do not consider CUDA cores in any calculations.

Maximum number of threads depends on the possibility of calculation. CC2.0-3.x supports a maximum of 1024 threads per block with sufficient registers and warp slots. Deformations are statically assigned to strain planners. The number of warp schedulers for SM is 1 for CC 1.x, 2 for CC 2.x, and 4 for CC 3.x.

If your application has not run concurrent kernels, then to use each SM gridDim must have> = the number of SM blocks.

In order for the GTX650m to fully use its computing power, you must have at least two blocks (otherwise, you can use only one SM with one block). On the other hand, if you want to schedule threads 10240, you can easily schedule 10 blocks of 1024 threads each.

The number of CUDA cores and the number of threads

More articles: