GPUs are built with several types of memory. One type is DDRx RAM, available for the CPU and GPU. In OpenCL and CUDA, this is called "global" memory. For GPUs, data in global memory must be transferred between the GPU and Host. It is usually located in banks to provide access to pipelined memory. Therefore, random reads / writes to the "global" memory are relatively slow. The best way to access global memory is sequentially.
Its size varies from 1G to 6B per device.
The next type of memory is the GPU. It shares the memory available for multiple threads / distortions in the compute module / multiprocessor processor. This is faster than global memory, but not directly accessible from the host. CUDA calls this shared memory. OpenCL calls this local memory. This is the best memory for random access to arrays. There are 48K for CUDA and 32K for OpenCL.
The third kind of memory is the GPU registers, called private in OpenCL or local in CUDA. Private memory is the fastest, but less available than local / shared memory.
The best strategy for optimizing random access to memory is to copy data between global and local / shared memory. Thus, the GPU application will copy parts of its global memory to local / shared, work using local / shared and copy the results back to global.
Copying to a local one, a process using local and copying back to global is an important template for understanding and teaching GPU programming well.
source share