Efficient access to arbitrary GPU memory with OpenGL

What is the best template for getting the GPU to efficiently compute the “anti-functional” routines that usually depend on writing positioned memory instead of reading? For instance. for example, calculating a histogram, sorting, dividing the number by percent, combining data of different sizes into lists, etc. etc.

+6
source share
2 answers

Established terms collect readings and scatter records

collect notes

This means that your program will write to a fixed position (for example, the target position of the fragment fragment shader), but has quick access to arbitrary data sources (textures, uniforms, etc.).

scatter records

This means that the program receives a stream of input data, which it cannot arbitrarily address, but can perform quick write to arbitrary memory cells.

Obviously, the OpenGL shader architecture is a data collection system. The latest OpenGL-4 also allows you to do some scatter recording in the fragment shader, but they are slow.

So, what is the most efficient way these days to emulate "scatter" with OpenGL. For now, it uses a vertex shader working on pixel sizes. You send as many points as you have to process and scatter them in the target memory, accordingly setting your positions. You can use geometric and tessellation shaders to get points processed in the vertex block. You can use texture buffers and UBOs to enter data using the vertex / point index for addressing.

+7
source

GPUs are built with several types of memory. One type is DDRx RAM, available for the CPU and GPU. In OpenCL and CUDA, this is called "global" memory. For GPUs, data in global memory must be transferred between the GPU and Host. It is usually located in banks to provide access to pipelined memory. Therefore, random reads / writes to the "global" memory are relatively slow. The best way to access global memory is sequentially.
Its size varies from 1G to 6B per device.

The next type of memory is the GPU. It shares the memory available for multiple threads / distortions in the compute module / multiprocessor processor. This is faster than global memory, but not directly accessible from the host. CUDA calls this shared memory. OpenCL calls this local memory. This is the best memory for random access to arrays. There are 48K for CUDA and 32K for OpenCL.

The third kind of memory is the GPU registers, called private in OpenCL or local in CUDA. Private memory is the fastest, but less available than local / shared memory.

The best strategy for optimizing random access to memory is to copy data between global and local / shared memory. Thus, the GPU application will copy parts of its global memory to local / shared, work using local / shared and copy the results back to global.

Copying to a local one, a process using local and copying back to global is an important template for understanding and teaching GPU programming well.

0
source

Source: https://habr.com/ru/post/909326/


All Articles