CUDA Unable to see shared memory values ​​in Nsight debugging

For some time I struggled with a problem with which I can not find a solution. The problem is that when I try to debug my CUDA code using Nvidia Nsight in Visual Studio 2008, I get strange results when using shared memory.

My code is:

template<typename T> __device__ T integrate() { extern __shared__ T s_test[]; // Dynamically allocated shared memory /**** Breakpoint (1) here ****/ int index = threadIdx.x + threadIdx.y * blockDim.x; // Local index in block. Column major ordering if(index < 64 && blockIdx.x==0) { // Only work on a few values. Just testing s_test[index] = (T)index; /* Some other irelevant code here */ } return v; } 

When I get to breakpoint 1 and look at the shared memory inside the Visual Studio Watch window, only the first 8 values ​​of the array change, and the rest remain empty. I would expect all the first 64 to do this. Watch window from Visual Studio

I thought this could have something to do with all the distortions that are not being performed at the same time. So I tried to sync them. I added this code inside integrate()

 template<typename T> __device__ T integrate() { /* Old code is still here */ __syncthreads(); /**** Breakpoint (2) here ****/ if(index < 64 && blockIdx.x==0) { T tmp = s_test[index]; // Write to tmp variable so I can inspect it inside Nsight Watch window v = tmp + index; // Use `tmp` and `index` somehow so that the compiler doesn't optimize it out of existence } return v; } 

But the problem still exists. In addition, the rest of the values ​​inside tmp are not 0 , as the VS form of the viewport is displayed. Watch window from Nsight

I should mention that it takes a lot of steps to go to __syncthreads() , so when I get to it I just go to breakpoint 2. What the hell is going on !?


EDIT System / Startup Configuration Information

System

  • Name Intel (R) Core (TM) 2 Duo CPU E7300 @ 2.66 GHz
  • X86 architecture
  • Frequency 2.666 MHz
  • The number of cores 2
  • Page Size 4.096
  • Total physical memory 3.582.00 MB
  • Available physical memory 1.983.00 MB
  • Windows 7 Ultimate Version Name
  • Version Number 6.1.7600

GeForce 9500 GT

  • Driver Version 301.42
  • WDDM Driver Model
  • CUDA device index 0
  • GPU G96 Family
  • Computing ability 1.1
  • Number of SMs 4
  • Physical frame buffer size (MB) 512
  • Frame Buffer Bandwidth (GB / s) 16
  • Frame buffer bus width (bit) 128
  • Frame buffer location highlighted
  • Graphic Clock (Mhz) 812
  • Memory Clock (Mhz) 500
  • Processor Clock (Mhz) 1625
  • RAM Type DDR2

IDE

  • Microsoft Visual Studio Team System 2008
  • NVIDIA Nsight Visual Studio Edition, Version 2.2 Build No. 2.2.0.12255

Compiler

1> "C: \ Program Files \ NVIDIA GPU Computing Toolkit \ CUDA \ v4.2 \ bin \ nvcc.exe" -G -gencode = arch = compute_10, code = \ "sm_10, compute_10 \" --machine 32 -ccbin "C: \ Program Files \ Microsoft Visual Studio 9.0 \ VC \ bin" -D_NEXUS_DEBUG -g -D_DEBUG -Xcompiler "/ EHsc / W3 / nologo / Od / Zi / RTC1 / MDd" -I "inc" -I "C: \ Program Files \ NVIDIA GPU Computing Toolkit \ CUDA \ v4.2 \ include "-maxrregcount = 0 --compile -o" Debug / process_f2f.cu.obj "process_f2f.cu

Run configuration . The size of shared memory does not seem to matter. I tried several versions. The one I worked with the most:

  • Total memory 2048 bytes
  • Grid / Block Sizes: {101, 101, 1}, {16, 16, 1}
+4
source share
1 answer

Have you tried putting __ syncthreads () after assigning values?

 template<typename T> __device__ T integrate() { extern __shared__ T s_test[]; // Dynamically allocated shared memory int index = threadIdx.x + threadIdx.y * blockDim.x; // Local index in block. Column major ordering if(index < 64 && blockIdx.x==0) { // Only work on a few values. Just testing s_test[index] = (T)index; /* Some other irelevant code here */ } __syncthreads(); /**** Breakpoint (1) here ****/ return v; } 

And try to see the values ​​at this breakpoint.

+1
source

Source: https://habr.com/ru/post/1437439/


All Articles