Yes, in the cache mode, one transaction will be generated with a length of 128 bytes (as seen from the L1 cache level.) In the non-cache mode, four transactions will be created 32 bytes (as seen from the L2 cache level - it is still one 128-byte request coming from warp due to merging.) In the case described by you, four 32 byte transactions are not slower, for fully consolidated access, regardless of cached or inactive mode. The memory controller (on this GPU) must generate the same transactions to satisfy the warp request anyway. Since the memory controller consists of a number (up to 6) of "partitions", each of which has a path width of 64 bits, in the end, several memory transactions (for example, through several partitions) can be used to satisfy any request (4x32 bytes or 1x128byte). The specific number of transactions and the organization between partitions can vary from GPU to GPU (and is not part of your question, but a GPU with memory with DDR memory will return 16 bytes per partition for a memory transaction and with QDR pumping, it will return 32 bytes per each section per memory transaction). This also doesn’t apply to CUDA 5. Perhaps you should check out one of the NVIDIA webinars for this material, in particular, “Optimizing CUDA: Limited by the memory core”. Even if you do not want to watch a quick overview video, the slides will remind you of the different differences between the so-called "cached" and "unencrypted" calls (this applies to L1), and also provide you with the compiler keys needed for each case.
Another reason for reviewing the slides is that it will remind you under what circumstances you can try the “off” mode. In particular, if you have a scattered (uninsulated) access scheme coming from your skews, access to inaccessible modes can lead to improvement, because when requesting 32 bytes from memory, less "loss" is required to satisfy the request of one stream compared to 128 bytes of magnitude. However, in response to your final question, it is rather difficult to analyze this, because, apparently, your code is a combination of ordered and unordered access patterns. Since disabled mode is enabled using the compiler switch, the suggestion given in the slides is simply to "try your code in both directions" and see what works faster. In my experience, working in unexposed mode rarely gives improved perfection.
EDIT: Sorry, I had a link and title for the wrong presentation. Fixed slide / video link and weblog title.
source share