For some time I struggled with a problem with which I can not find a solution. The problem is that when I try to debug my CUDA code using Nvidia Nsight in Visual Studio 2008, I get strange results when using shared memory.
My code is:
template<typename T> __device__ T integrate() { extern __shared__ T s_test[];
When I get to breakpoint 1 and look at the shared memory inside the Visual Studio Watch window, only the first 8 values ββof the array change, and the rest remain empty. I would expect all the first 64 to do this. 
I thought this could have something to do with all the distortions that are not being performed at the same time. So I tried to sync them. I added this code inside integrate()
template<typename T> __device__ T integrate() { __syncthreads(); if(index < 64 && blockIdx.x==0) { T tmp = s_test[index];
But the problem still exists. In addition, the rest of the values ββinside tmp are not 0 , as the VS form of the viewport is displayed. 
I should mention that it takes a lot of steps to go to __syncthreads() , so when I get to it I just go to breakpoint 2. What the hell is going on !?
EDIT System / Startup Configuration Information
System
- Name Intel (R) Core (TM) 2 Duo CPU E7300 @ 2.66 GHz
- X86 architecture
- Frequency 2.666 MHz
- The number of cores 2
- Page Size 4.096
- Total physical memory 3.582.00 MB
- Available physical memory 1.983.00 MB
- Windows 7 Ultimate Version Name
- Version Number 6.1.7600
GeForce 9500 GT
- Driver Version 301.42
- WDDM Driver Model
- CUDA device index 0
- GPU G96 Family
- Computing ability 1.1
- Number of SMs 4
- Physical frame buffer size (MB) 512
- Frame Buffer Bandwidth (GB / s) 16
- Frame buffer bus width (bit) 128
- Frame buffer location highlighted
- Graphic Clock (Mhz) 812
- Memory Clock (Mhz) 500
- Processor Clock (Mhz) 1625
- RAM Type DDR2
IDE
- Microsoft Visual Studio Team System 2008
- NVIDIA Nsight Visual Studio Edition, Version 2.2 Build No. 2.2.0.12255
Compiler
1> "C: \ Program Files \ NVIDIA GPU Computing Toolkit \ CUDA \ v4.2 \ bin \ nvcc.exe" -G -gencode = arch = compute_10, code = \ "sm_10, compute_10 \" --machine 32 -ccbin "C: \ Program Files \ Microsoft Visual Studio 9.0 \ VC \ bin" -D_NEXUS_DEBUG -g -D_DEBUG -Xcompiler "/ EHsc / W3 / nologo / Od / Zi / RTC1 / MDd" -I "inc" -I "C: \ Program Files \ NVIDIA GPU Computing Toolkit \ CUDA \ v4.2 \ include "-maxrregcount = 0 --compile -o" Debug / process_f2f.cu.obj "process_f2f.cu
Run configuration . The size of shared memory does not seem to matter. I tried several versions. The one I worked with the most:
- Total memory 2048 bytes
- Grid / Block Sizes: {101, 101, 1}, {16, 16, 1}
source share