Permalink cuda memory

I have an array in read-only memory (this is a global variable) and got a link to it by calling the cudaGetSymbolAddress function. My kernel is slow when I use this link to get constant data, and not to use a global variable. What is the reason for this?

__constant__ int g[2] = {1,2}; // __device__ int g[2] = {1,2}; // kernel: use by reference __global__ void add_1( int *a, int *b, int *c, int *f ) { int tid = blockIdx.x * blockDim.x + threadIdx.x; c[tid] = f[0] * a[tid] + f[1] * b[tid]; } // kernel: use global variable __global__ void add_2( int *a, int *b, int *c, int *f ) { int tid = blockIdx.x * blockDim.x + threadIdx.x; c[tid] = g[0] * a[tid] + f[1] * b[tid]; } int main() { ...... // a,b,c are large arrays in device memory of size 40960. int *f; cudaGetSymbolAddress( (void **)&f, (char *)&g); add_1 <<< 160, 256 >>> ( a, b, c, f ); ...... } 

This is sample code, and all threads in warp load the same place at the same time. Commented code is direct access to read-only memory

Explanation why persistent memory cache is not used (by talonmies)

The reason is the lack of a permanent cache. Cached access only occurs when the compiler emits a specific PTX instruction (ld.const) for a variable that is explicitly marked as being in constant state space. And the way the compiler knows to do this is when the __constant__ variable is __constant__ - this is a static compile-time attribute that affects code generation. The same process cannot happen at runtime.

If you pass a pointer to global memory, and the compiler cannot determine that the pointer is in constant-state space, it will not generate the correct PTX to access that memory through a constant cache. As a result, access will be slower.

Unanswered question

Why, even when the g array is declared as a __device__ variable, the code is slower when using a reference to it. After PTX code to load the global memory into registers:

  • 2 commands ld.global.s32 , which load 4 bytes into the register. (in code using a link)
  • 1 command ld.global.v2.s32 , which loads 8 bytes in 2 registers (in code using a global variable)

What is the difference and any link to the documentation will be appreciated?

+4
source share
1 answer

Unlike global memory, access to read-only memory will be serialized (divided into several transactions) if they are not homogeneous (all threads (half-for compute capacity 1.x) warp get access to the same address.

So use only read-only memory if access is likely to be uniform.

+2
source

Source: https://habr.com/ru/post/1440283/


All Articles