Putting a for loop into the CUDA core

Good idea to put a for loop in the kernel?
or is this a common thing?

+6
source share
4 answers

It is customary to put loops in kernels. This does not mean that it is always a good idea, but it does not mean that this is also not so.

The common problem of determining how to efficiently distribute your tasks and data and use related parallelisms is a very complex and unresolved problem, especially when it comes to CUDA. Active research is being conducted to effectively determine (that is, without blindly studying the parameter space) how to achieve the best results for these cores.

Sometimes, it can make a lot of sense to put loops in kernels. For example, iterative calculations on many elements of a large regular data structure, demonstrating strong data independence, are ideal for cores containing loops. In other cases, you can decide that each thread processes many data points if, for example, you would not have enough shared memory to allocate one thread for each task (this is not uncommon when a large number of threads are shared by a large amount of data, but by increasing the amount of work performed in the thread, you can put all the threads "shared data in shared memory."

It’s best to make an educated guess, test, profile and review as you need. There are many opportunities for a game with optimization ... startup parameters, global and constant compared to shared memory, keeping the number of registers cool, ensuring pooling and avoiding conflicts in memory banks, etc. If you are interested in performance, you should check out CUDA C Best Practices and the CUDA Employment Calculator, available from NVIDIA on the CUDA 4.0 documentation page (if you haven’t already).

+6
source

Generally, it’s good if you are careful about memory access patterns. If the for loop accesses memory randomly, which will result in reading many non-memory modules, this can be very slow.

In fact, I once had a code snippet slower with CUDA, because I naively looped the for loop in the kernel. However, as soon as I thought about memory access, for example, loading a piece at a time into shared access, so that each block of the thread could simultaneously execute part of the for loop, it was much faster.

+4
source
  • The basic template for processing big data is to use a breakdown by type where the input data is separated and each stream runs on its data fragment, where the loop is defined.
    Example 1: if the input is a 2D matrix, known for its number of rows exceeds the number of columns, I would access the row using a unique index of the grid block, and access the column using a stream-split flow approach using a loop over the size of the tile.
    Example 2: If your threads need to process one value, which is necessary for further calculations. (For example, standard vector normalization), you need an approach using tiles, since only inside blocks can flows synchronize efficiently.
+1
source

While this is not at the top level, you probably should be fine. Doing this at the top level negates all the benefits of CUDA.

As Dan points out, memory access problems are becoming a problem. One way is to load the loaded memory into shared memory or into texture memory if it is not suitable for sharing. The reason is that access to global access to the global network is very slow (~ 400 clock cycles, not ~ 40 shared memory).

0
source

Source: https://habr.com/ru/post/894812/


All Articles