CUDA - limit the number of SMs used

Is there a way that I could EXPLICITly limit the number of GPU multiprocessors used during the execution of my program? I would like to calculate how my algorithm scales with an increase in the number of multiprocessors.

If this helps: I am using CUDA 4.0 and a 2.0 computing device.

+4
source share
1 answer

Aaahhh ... I know the problem. I myself played with him when writing an article.

There is no explicit way to do this, however you can try to “crack” it if some of the blocks do nothing.

  • If you never run more blocks, since there are multiprocessors, then your work is simple - just run even fewer blocks. Some of the SMs are guaranteed to have no work, because the block cannot be divided into several SMs.
  • If you run a lot more blocks, and you just rely on the driver to schedule them, use a different approach: just run as many blocks as your GPU can process, and if one of the blocks completes but doesn't stop, go back to the top and select another piece of data to work with. Most likely, the performance of your program will not fall; it can even get better if you plan your work carefully :)
  • The biggest problem is that all your blocks work on the GPU at the same time, but you have more than one block on the SM. Then you need to start normally, but manually “disable” some of the blocks and order other blocks to work on them. The problem is that it locks the lock to ensure that one SM is working and the other is not.

From my own experiments, 1.3 devices (I had a GTX 285) planned blocks in sequence. So, if I ran 60 blocks on 30 SM, blocks 1-30 are scheduled on SM 1-30, and then 31-60 again on SM from 1 to 30. So, by disabling blocks 5 and 35, SM number 5 is almost nothing then do it.

Note, however, this is my personal, experimental observation, which I made 2 years ago. It is in no way confirmed, not supported, not supported, not NVIDIA, and may change (or have already changed) with new GPUs and / or drivers.

I would advise - try playing with some simple kernels that do a lot of stupid work and see how long it takes to calculate on different "on" / "disabled" configurations. If you are lucky, you will understand the performance drop, indicating that in fact two blocks are actually executed by the same SM.

+1
source

Source: https://habr.com/ru/post/1383181/


All Articles