I think you need to take a deeper look at kernel planning in cuda. A.
There are two important sizes: blocks and threads per block
Each block is planned for one SM and then cut into skews. Therefore, the blocks have shared memory, accessible only inside the block, because it lies on the SM memory. The number of blocks for each SM depends on the device constraint and occupancy calculation. The maximum blocks on SM are 8 for CC 1.0-2.x and 16 for CC 3.x.
Each block has a certain number of threads per block . Streams are separated in deformations, and deformations can be performed in an arbitrary order defined only by deformation, scheduler and SM.
Your card now has a total of 384 cores per 2 SM with 192 cores each. The CUDA kernel count is the total number of floating point instructions or single precision integer points that can be executed per cycle. Do not consider CUDA cores in any calculations.
Maximum number of threads depends on the possibility of calculation. CC2.0-3.x supports a maximum of 1024 threads per block with sufficient registers and warp slots. Deformations are statically assigned to strain planners. The number of warp schedulers for SM is 1 for CC 1.x, 2 for CC 2.x, and 4 for CC 3.x.
If your application has not run concurrent kernels, then to use each SM gridDim must have> = the number of SM blocks.
In order for the GTX650m to fully use its computing power, you must have at least two blocks (otherwise, you can use only one SM with one block). On the other hand, if you want to schedule threads 10240, you can easily schedule 10 blocks of 1024 threads each.
source share