CUDA: Is it possible to use all 48KB of On-Die memory as shared memory?

I am developing a CUDA application for the GTX 580 with CUDA Toolkit 4.0 and Visual Studio 2010 Professional for Windows 7 with 64-bit SP1. My program is more memory intensive than regular CUDA programs, and I try to allocate as much shared memory as possible for each CUDA block. However, the program crashes every time I try to use more than 32 KB of shared memory for each block.

From reading the official CUDA docs, I learned that for every SM on a CUDA drive with Compute Capability 2.0 or higher, there is 48 Kbytes of memory on the disk, and the memory on the disk is shared between the L1 cache and shared memory:

The same internal memory is used for both L1 and shared memory, and how much it is dedicated to L1 and shared memory is configured for each kernel call (Section F.4.1) http://developer.download.nvidia.com/compute/DevZone /docs/html/C/doc/Fermi_Tuning_Guide.pdf

This made me suspect that only 32 KB of memory with one type of memory was allocated as shared memory when my program was running. Therefore, my question is: is it possible to use all 48 KB of memory on the disk as shared memory?

I tried everything I could think of. I specified the --ptxas-options = "- v -dlcm = cg" parameter for nvcc, and I called cudaDeviceSetCacheConfig () and cudaFuncSetCacheConfig () in my program, but none of them solved the problem. I was even convinced that there was no scrolling of the register and that I accidentally did not use local memory:

1> 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 1> ptxas info : Used 63 registers, 40000+0 bytes smem, 52 bytes cmem[0], 2540 bytes cmem[2], 8 bytes cmem[14], 72 bytes cmem[16] 

Although I can live with 32 KB of shared memory, which has already given me a huge performance boost, I would prefer to use all the fast memory to the full. Any help is greatly appreciated.

Update: I started 640 threads when the program crashed. 512 gave me better performance than 256, so I tried to increase the number of threads further.

+4
source share
3 answers

Your problem is not with the configuration of shared memory, but with the number of threads you started.

Using 63 registers per thread and starting 640 threads, you get a total of 40320 registers. The total register of your device is 32K, so you are running out of resources.

Regarding the built-in memory, this is well explained in Tom's answer, and as he commented, check that the API calls for errors will help you in future errors.

+6
source

Devices with a computing capacity of 2.0 and higher have 64 KB of internal memory on SM. This is configured as 16KB L1 and 48KB smem or 48KB L1 and 16KB smem (also 32/32 for 3.x calculation capabilities).

Your program crashes for another reason. Are you checking all API calls for errors? Have you tried cuda-memcheck?

If you use too much shared memory, then you will receive an error message when you start the kernel, saying that there are not enough resources.

+3
source

In addition, passing parameters from the host to the GPU uses shared memory (up to 256 bytes), so you will never get the actual 48 KB.

-1
source

Source: https://habr.com/ru/post/1433954/


All Articles