CUDA Global Memory Transaction Cost

According to the CUDA 5.0 Programming Guide, if I use both L1 and L2 caching (on Fermi or Kepler), all global memory operations are performed using 128-byte memory transactions. However, if I use only L2, transactions with 32 bytes of memory are used (chapter F.4.2).

Assume all caches are empty. If I have a warp and each thread accessing one 4-byte word will be perfectly aligned, this will lead to transaction 1x128B in the case of L1 + L2 and in transaction 4x32B in the case of only for L2. Is it correct?

My question is: Are transactions 4 32B slower than a single transaction 128B? My intuition from pre-Fermi equipment suggests that it will be slower, but maybe this is not so on the new equipment? Or maybe I should just look at the use of bandwidth to evaluate the efficiency of accessing my memory?

+4
source share
1 answer

Yes, in the cache mode, one transaction will be generated with a length of 128 bytes (as seen from the L1 cache level.) In the non-cache mode, four transactions will be created 32 bytes (as seen from the L2 cache level - it is still one 128-byte request coming from warp due to merging.) In the case described by you, four 32 byte transactions are not slower, for fully consolidated access, regardless of cached or inactive mode. The memory controller (on this GPU) must generate the same transactions to satisfy the warp request anyway. Since the memory controller consists of a number (up to 6) of "partitions", each of which has a path width of 64 bits, in the end, several memory transactions (for example, through several partitions) can be used to satisfy any request (4x32 bytes or 1x128byte). The specific number of transactions and the organization between partitions can vary from GPU to GPU (and is not part of your question, but a GPU with memory with DDR memory will return 16 bytes per partition for a memory transaction and with QDR pumping, it will return 32 bytes per each section per memory transaction). This also doesn’t apply to CUDA 5. Perhaps you should check out one of the NVIDIA webinars for this material, in particular, “Optimizing CUDA: Limited by the memory core”. Even if you do not want to watch a quick overview video, the slides will remind you of the different differences between the so-called "cached" and "unencrypted" calls (this applies to L1), and also provide you with the compiler keys needed for each case.

Another reason for reviewing the slides is that it will remind you under what circumstances you can try the “off” mode. In particular, if you have a scattered (uninsulated) access scheme coming from your skews, access to inaccessible modes can lead to improvement, because when requesting 32 bytes from memory, less "loss" is required to satisfy the request of one stream compared to 128 bytes of magnitude. However, in response to your final question, it is rather difficult to analyze this, because, apparently, your code is a combination of ordered and unordered access patterns. Since disabled mode is enabled using the compiler switch, the suggestion given in the slides is simply to "try your code in both directions" and see what works faster. In my experience, working in unexposed mode rarely gives improved perfection.

EDIT: Sorry, I had a link and title for the wrong presentation. Fixed slide / video link and weblog title.

+5
source

Source: https://habr.com/ru/post/1438628/


All Articles