Suppose we have an int * data array, each thread will access one element of this array. Since this array will be shared between all threads, it will be stored in global memory.
Let me create a test kernel:
__global__ void test(int *data, int a, int b, int c){ ... }
I know for sure that the data array will be in global memory, because I allocated memory for this array using cudaMalloc . Now, as for other variables, I saw several examples that pass an integer without allocating memory directly to the kernel function. In my case, such variables are a b and c .
If I'm not mistaken, even if we do not directly cudaMalloc to allocate 4 bytes for each three integers, CUDA will automatically do this for us, so at the end the variables a b and c will be allocated in global memory.
Now these variables are only auxiliary, the threads only read them and nothing more.
My question is: is it not better to transfer these variables to shared memory?
I assume that if we had, for example, 10 blocks with 1024 streams, we would need 10*3 = 30 to read 4 bytes in order to store numbers in the shared memory of each block.
Without shared memory, and if each thread needs to read all three of these variables once, the total global memory reading will be 1024*10*3 = 30720 , which is very inefficient.
Now here is the problem, I'm somewhat new to CUDA, and I'm not sure if it is possible to transfer memory for variables a b and c to the shared memory of each block, without having each thread read these variables from global memory and load them into shared memory, so in As a result, the total reading volume of global memory will be 1024*10*3 = 30720 , and not 10*3 = 30 .
The following website provides an example:
__global__ void staticReverse(int *d, int n) { __shared__ int s[64]; int t = threadIdx.x; int tr = nt-1; s[t] = d[t]; __syncthreads(); d[t] = s[tr]; }
Here, each thread loads different data inside the shared variable s . Therefore, each thread, in accordance with its index, loads the specified data into shared memory.
In my case, I want to load only the shared variables a b and c into shared memory. These variables are always the same, they do not change, therefore they have nothing to do with the threads themselves, they are auxiliary and are used by each thread to run some algorithm.
How do I approach this problem? Is it possible to achieve this only by making total_amount_of_blocks*3 global memory?