It is customary to put loops in kernels. This does not mean that it is always a good idea, but it does not mean that this is also not so.
The common problem of determining how to efficiently distribute your tasks and data and use related parallelisms is a very complex and unresolved problem, especially when it comes to CUDA. Active research is being conducted to effectively determine (that is, without blindly studying the parameter space) how to achieve the best results for these cores.
Sometimes, it can make a lot of sense to put loops in kernels. For example, iterative calculations on many elements of a large regular data structure, demonstrating strong data independence, are ideal for cores containing loops. In other cases, you can decide that each thread processes many data points if, for example, you would not have enough shared memory to allocate one thread for each task (this is not uncommon when a large number of threads are shared by a large amount of data, but by increasing the amount of work performed in the thread, you can put all the threads "shared data in shared memory."
Itβs best to make an educated guess, test, profile and review as you need. There are many opportunities for a game with optimization ... startup parameters, global and constant compared to shared memory, keeping the number of registers cool, ensuring pooling and avoiding conflicts in memory banks, etc. If you are interested in performance, you should check out CUDA C Best Practices and the CUDA Employment Calculator, available from NVIDIA on the CUDA 4.0 documentation page (if you havenβt already).
source share