Is it worth passing kernel parameters through shared memory?

Suppose we have an int * data array, each thread will access one element of this array. Since this array will be shared between all threads, it will be stored in global memory.

Let me create a test kernel:

  __global__ void test(int *data, int a, int b, int c){ ... } 

I know for sure that the data array will be in global memory, because I allocated memory for this array using cudaMalloc . Now, as for other variables, I saw several examples that pass an integer without allocating memory directly to the kernel function. In my case, such variables are a b and c .

If I'm not mistaken, even if we do not directly cudaMalloc to allocate 4 bytes for each three integers, CUDA will automatically do this for us, so at the end the variables a b and c will be allocated in global memory.

Now these variables are only auxiliary, the threads only read them and nothing more.

My question is: is it not better to transfer these variables to shared memory?

I assume that if we had, for example, 10 blocks with 1024 streams, we would need 10*3 = 30 to read 4 bytes in order to store numbers in the shared memory of each block.

Without shared memory, and if each thread needs to read all three of these variables once, the total global memory reading will be 1024*10*3 = 30720 , which is very inefficient.

Now here is the problem, I'm somewhat new to CUDA, and I'm not sure if it is possible to transfer memory for variables a b and c to the shared memory of each block, without having each thread read these variables from global memory and load them into shared memory, so in As a result, the total reading volume of global memory will be 1024*10*3 = 30720 , and not 10*3 = 30 .

The following website provides an example:

  __global__ void staticReverse(int *d, int n) { __shared__ int s[64]; int t = threadIdx.x; int tr = nt-1; s[t] = d[t]; __syncthreads(); d[t] = s[tr]; } 

Here, each thread loads different data inside the shared variable s . Therefore, each thread, in accordance with its index, loads the specified data into shared memory.

In my case, I want to load only the shared variables a b and c into shared memory. These variables are always the same, they do not change, therefore they have nothing to do with the threads themselves, they are auxiliary and are used by each thread to run some algorithm.

How do I approach this problem? Is it possible to achieve this only by making total_amount_of_blocks*3 global memory?

+4
source share
1 answer

The GPU runtime already does this optimally if you don't need to do anything (and your assumption about how argument passing works in CUDA is wrong). This is currently happening:

  • In the computing capabilities of 1.0 / 1.1 / 1.2 / 1.3 of the device, kernel arguments are passed by the runtime in shared memory.
  • In the computing capabilities of the 2.x / 3.x / 4.x / 5.x / 6.x device, kernel arguments are passed by the runtime in the backup memory bank (which has a dedicated cache with broadcast transmission).

So, in your hypothetical core

 __global__ void test(int *data, int a, int b, int c){ ... } 

data , a , b and c automatically passed by value to each block in shared or read-only memory (depending on the architecture of the GPU). There is no advantage in what you offer.

+10
source

Source: https://habr.com/ru/post/1482767/