In devices 1.0, you had only two options:
- Memory access is combined and all data is retrieved in a single memory transaction.
- Access to memory is not isolated, and data is retrieved one after another - therefore, there are always 16 memory transactions (half-warp).
In devices 1.2 and 1.3, however, this is done differently. Imagine that your device memory is divided into pieces of 128 bytes. You need as many memory transactions as number of strokes. So:
- If you get fully consolidated access, you get 1 memory transaction
- If you just misconfigured, you can get 2 memory transactions.
- If each thread gains access to every nth word, you can get 3, 4, or even more memory transactions.
- In the worst case, you can get 16 memory transactions.
- but even if the access is somewhat random, but localized, two threads may end up in the same block, and you will need less than 16 memory transactions.
There are so many cases, so putting it in 2 categories: coalesced / uncoalesced no longer makes sense. That's why, Cuda Profiler went a different way. They simply count the number of memory transactions. The more random your access pattern is, the higher the amount of memory transactions, even if you have the same memory access account.
The above model is a bit simplified. In fact, a memory transaction can access a 128-byte, 64-byte, or 32-byte block β to save bandwidth. Look at loading 128b columns, load 64b, load 32b and save 128b, save 64b, save 32b in your profiler.
source share