How to correctly allocate an array of global memory using the uint4 vector in CUDA to increase memory bandwidth?

Question

How to correctly allocate an array of global memory using the uint4 vector in CUDA to increase memory bandwidth?

Typically, there are two ways to increase the memory bandwidth of the global memory on the CUDA core at a processing power of 1.3 GPUs; memory gains access to coalescence and access to words of at least 4 bytes. With the first method of accessing the same memory segment, threads of the same half-pattern are combined with fewer transactions when accessing words of at least 4 bytes, this memory segment is effectively increased from 32 to 128.

Update: solution based on talonmies answers . To access 16 bytes instead of 1-byte words when unsigned characters are stored in global memory, the uint4 vector is usually used by casting a memory array in uint4. To get the values from the uint4 vector, it can be converted to uchar4, as shown below:

#include <cuda.h> #include <stdio.h> #include <stdlib.h> __global__ void kernel ( unsigned char *d_text, unsigned char *d_out ) { int idx = blockIdx.x * blockDim.x + threadIdx.x; extern __shared__ unsigned char s_array[]; uint4 *uint4_text = reinterpret_cast<uint4 *>(d_text); uint4 uint4_var; //memory transaction uint4_var = uint4_text[0]; //recast data to uchar4 uchar4 c0 = *reinterpret_cast<uchar4 *>(&uint4_var.x); uchar4 c4 = *reinterpret_cast<uchar4 *>(&uint4_var.y); uchar4 c8 = *reinterpret_cast<uchar4 *>(&uint4_var.z); uchar4 c12 = *reinterpret_cast<uchar4 *>(&uint4_var.w); d_out[idx] = c0.y; } int main ( void ) { unsigned char *d_text, *d_out; unsigned char *h_out = ( unsigned char * ) malloc ( 16 * sizeof ( unsigned char ) ); unsigned char *h_text = ( unsigned char * ) malloc ( 16 * sizeof ( unsigned char ) ); int i; for ( i = 0; i < 16; i++ ) h_text[i] = 65 + i; cudaMalloc ( ( void** ) &d_text, 16 * sizeof ( unsigned char ) ); cudaMalloc ( ( void** ) &d_out, 16 * sizeof ( unsigned char ) ); cudaMemcpy ( d_text, h_text, 16 * sizeof ( unsigned char ), cudaMemcpyHostToDevice ); kernel<<<1,16>>>(d_text, d_out ); cudaMemcpy ( h_out, d_out, 16 * sizeof ( unsigned char ), cudaMemcpyDeviceToHost ); for ( i = 0; i < 16; i++ ) printf("%c\n", h_out[i]); return 0; }

+4

optimization c cuda

charis Oct 27 '12 at 16:58

source share

2 answers

Char * listing will work fine. You tried? If so, what happened, what caused this question?

In your example, it looks like you can just pass s_array to int* and make one copy from var.x (multiplying j by 4 instead of 16).

If you need more flexible byte shuffling in a word, you can use the built-in __byte_perm() . For example, to reverse the byte order of an integer x , you can do __byte_perm(x, 0, 0x0123);

You may not get anything using vector types or even a single int to store bytes. In Fermi, global memory transactions are 128 bytes. So, when your warp hits an instruction that loads / stores from global memory, the GPU will execute as many 128-byte transactions as needed to serve 32 threads. Performance will largely depend on how many separate transactions are needed, and not how each thread loads or stores its bytes.

+3

Roger dahl Oct 27 '12 at 18:43

source share

talonmies · Accepted Answer · 2012-10-28T12:27:06+0000

If I understand what you're trying to do, the logical approach is to use the C ++ reinterpret_cast mechanism to force the compiler to generate the correct vector load instruction, and then use the CUDA built into the uchar4 byte size vector type to access each byte in each of the four 32-bit words loaded from global memory. Using this approach, you really trust the compiler, knowing the best way to use byte access in each 32-bit register.

A completely contrived example might look like this:

 #include <cstdio> #include <cstdlib> __global__ void kernel(unsigned int *in, unsigned char* out) { int tid = threadIdx.x; uint4* p = reinterpret_cast<uint4*>(in); uint4 i4 = p[tid]; // vector load here uchar4 c0 = *reinterpret_cast<uchar4 *>(&i4.x); uchar4 c4 = *reinterpret_cast<uchar4 *>(&i4.y); uchar4 c8 = *reinterpret_cast<uchar4 *>(&i4.z); uchar4 c12 = *reinterpret_cast<uchar4 *>(&i4.w); out[tid*4+0] = c0.x; out[tid*4+1] = c4.y; out[tid*4+2] = c8.z; out[tid*4+3] = c12.w; } int main(void) { unsigned int c[8] = { 2021161062, 2021158776, 2020964472, 1920497784, 2021161058, 2021161336, 2020898936, 1702393976 }; unsigned int * _c; cudaMalloc((void **)&_c, sizeof(int)*size_t(8)); cudaMemcpy(_c, c, sizeof(int)*size_t(8), cudaMemcpyHostToDevice); unsigned char * _m; cudaMalloc((void **)&_m, sizeof(unsigned char)*size_t(8)); kernel<<<1,2>>>(_c, _m); unsigned char m[8]; cudaMemcpy(m, _m, sizeof(unsigned char)*size_t(8), cudaMemcpyDeviceToHost); for(int i=0; i<8; i++) fprintf(stdout, "%d %c\n", i, m[i]); return 0; }

which should create a readable string of characters embedded in an array of unsigned integers supplied to the kernel.

One caveat is that the open64 compiler used for computing 1.x often defeats this strategy by trying to generate vector loads if it can detect that not all words in the vector were actually used. So make sure you touch all input words in the input vector type to make sure the compiler plays well.

How to correctly allocate an array of global memory using the uint4 vector in CUDA to increase memory bandwidth?

More articles: