The number of transactions with combined and non-isolated memory in the calculation of gpu 1.3

Question

The number of transactions with combined and non-isolated memory in the calculation of gpu 1.3

The cuda profiler reference indicates that due to a more relaxed coalescence policy, the number of uninsulated memory transactions will always be zero. But I'm sure there is still no denouement. How to calculate this? Are there any tools or simulator around that can help? Among them, which one seems the most accurate? Thanks

+4

opencl gpu gpgpu cuda

Zk1001 Mar 18 '12 at 12:04

source share

1 answer

CygnusX1 · Accepted Answer · 2012-03-18T12:52:58+0000

In devices 1.0, you had only two options:

Memory access is combined and all data is retrieved in a single memory transaction.
Access to memory is not isolated, and data is retrieved one after another - therefore, there are always 16 memory transactions (half-warp).

In devices 1.2 and 1.3, however, this is done differently. Imagine that your device memory is divided into pieces of 128 bytes. You need as many memory transactions as number of strokes. So:

If you get fully consolidated access, you get 1 memory transaction
If you just misconfigured, you can get 2 memory transactions.
If each thread gains access to every nth word, you can get 3, 4, or even more memory transactions.
In the worst case, you can get 16 memory transactions.
but even if the access is somewhat random, but localized, two threads may end up in the same block, and you will need less than 16 memory transactions.

There are so many cases, so putting it in 2 categories: coalesced / uncoalesced no longer makes sense. That's why, Cuda Profiler went a different way. They simply count the number of memory transactions. The more random your access pattern is, the higher the amount of memory transactions, even if you have the same memory access account.

The above model is a bit simplified. In fact, a memory transaction can access a 128-byte, 64-byte, or 32-byte block — to save bandwidth. Look at loading 128b columns, load 64b, load 32b and save 128b, save 64b, save 32b in your profiler.

The number of transactions with combined and non-isolated memory in the calculation of gpu 1.3

More articles: