I am developing a CUDA application for the GTX 580 with CUDA Toolkit 4.0 and Visual Studio 2010 Professional for Windows 7 with 64-bit SP1. My program is more memory intensive than regular CUDA programs, and I try to allocate as much shared memory as possible for each CUDA block. However, the program crashes every time I try to use more than 32 KB of shared memory for each block.
From reading the official CUDA docs, I learned that for every SM on a CUDA drive with Compute Capability 2.0 or higher, there is 48 Kbytes of memory on the disk, and the memory on the disk is shared between the L1 cache and shared memory:
The same internal memory is used for both L1 and shared memory, and how much it is dedicated to L1 and shared memory is configured for each kernel call (Section F.4.1) http://developer.download.nvidia.com/compute/DevZone /docs/html/C/doc/Fermi_Tuning_Guide.pdf
This made me suspect that only 32 KB of memory with one type of memory was allocated as shared memory when my program was running. Therefore, my question is: is it possible to use all 48 KB of memory on the disk as shared memory?
I tried everything I could think of. I specified the --ptxas-options = "- v -dlcm = cg" parameter for nvcc, and I called cudaDeviceSetCacheConfig () and cudaFuncSetCacheConfig () in my program, but none of them solved the problem. I was even convinced that there was no scrolling of the register and that I accidentally did not use local memory:
1> 24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 1> ptxas info : Used 63 registers, 40000+0 bytes smem, 52 bytes cmem[0], 2540 bytes cmem[2], 8 bytes cmem[14], 72 bytes cmem[16]
Although I can live with 32 KB of shared memory, which has already given me a huge performance boost, I would prefer to use all the fast memory to the full. Any help is greatly appreciated.
Update: I started 640 threads when the program crashed. 512 gave me better performance than 256, so I tried to increase the number of threads further.