OpenGL state redundancy exception Tree, state mapping priorities

I am working on an automatic OpenGL batch processing method in my game engine to reduce callbacks and redundant calls.

My batch tree design starts with the most expensive estates and adds leaves for every less expensive fortune.

Example: Tree root: shaders / programs Siblings: blending states ... aso

So my question is the most likely most expensive calls on this list:

  • binding program
  • texture snapping
  • buffer binding
  • buffer texture, vertex data
  • binding of rendering objects
  • glEnable / glDisable
  • mixing state equation, color, functions, colorWriteMask
  • depth state of the stencil depth Func, stencil operation, stencil function, writeMasks

It is also interesting which method will be faster:
- Collect all drawing command commands into a single vertex buffer and call only one draw call (this method will also force the matrix transformations to be updated to the vertex from the processor side)
- Do not partition and do not make many small drawing calls, only a system of batch particles ...

PS: The visualization goals will always change Pre or Post depending on usage.

Progress so far:

  • Andon M. Coleman: the cheapest form and vertex array binding, expensive FBOs, texture bindings.
  • datenwolf: programs invalid state cache

1: Framebuffer states
2: Program
3: texture binding
...
N: vertex array binding, uniform binding -

The current runtime tree in WebGL:

  • Program
  • Attribute Pointers
  • Texture
  • Mix State
  • Depth state
  • Stencil Status / Back
  • Rasterizer Status
  • Sampler State
  • Bind buffer
  • Arrays of drawing

Each step is a sister hash tree to avoid checking the status of the state cache inside the main rendering queue

The loading of textures / programs / shaders / buffers takes place before rendering in an additional queue, for future multithreading, as well as in order to initialize the context before doing anything with it.

The biggest problem with self-learning objects is that you cannot control when something happens, for example, if a developer calls these methods before gl initializes, he would not know why, but he will have some errors or problems ...

+5
source share
2 answers

The relative costs of such operations, of course, will depend on the usage pattern and your overall scenario. But you can find the Nvidia Beoynd Porting slide show as a useful guide. Let me reproduce especially slide 48 here:

The relative cost of state changes

  • With a reduction in cost ...
  • Render Target ~ 60K / s
  • Program ~ 300K / s
  • Rop
  • Bundles of texture ~ 1.5M / s
  • Vertex format
  • UBO bindings
  • Uniform Updates ~ 10M / s

This does not match all bullet points on your list. For instance. glEnable/glDisable can affect anything. In addition, GL buffer bindings are something that the GPU does not directly see. Buffer bindings mainly depend on the client side, depending on the purpose, of course. A change in the blending state will be a change in the ROP state, etc.

+7
source

This is typically highly platform / vendor dependent. Any numbers you can find relate to a specific version of the GPU, platform, and driver. And on the Internet there are many myths about this topic. If you really want to know, you need to write several benchmarks and run them on different platforms.

With all these caveats:

  • Switching Render target (FBO) tends to be quite expensive. However, it depends on the platform and architecture. For example, if you have some form of tile-based architecture that is awaiting rendering, which ideally should be delayed until the end of the frame, you may need to finish and wash. Or on more β€œclassic” architectures, there may be compressed color buffers or buffers used for early depth testing, which must be considered when switching rendering goals.

  • Updating texture or buffer data cannot be evaluated in general terms. Obviously, this greatly depends on how much data is being updated. Contrary to some claims on the Internet, calls like glBufferSubData() and glTexSubImage2D() usually do not cause synchronization. But they include copies of the data.

  • Binding programs do not have to be terribly expensive, but tend to be even more heavyweight than the state changes below.

  • Texture binding is mostly relatively cheap. But it really depends on the circumstances. For example, if you use a graphics processor with VRAM, and the texture is not currently in VRAM, this may result in copying the texture data from the system memory to VRAM.

  • Uniform updates. This is supposedly very fast on some platforms. But it is actually moderately expensive for others. Therefore, there is a lot of variability.

  • Vertex state tuning (including VBO and VAO bindings) is usually quick. This should be because in most applications this was done so often that it quickly became a bottleneck. But there is a similar consideration, as for textures, where the buffer memory can be copied / displayed if it has not been used recently.

  • General status updates, such as blending states, stencil status, or recording masks, are generally very quick. But there can be very significant exceptions.

Just a typical example of why the characteristics can be so different between architectures: if you change the state of the blend, it can send a couple of command words over the same architecture with minimal overhead. On other architectures, mixing is performed as part of a fragment shader. Therefore, if you change the state of the mix, you must change the shader program to fix the code for the new mix calculation.

0
source

Source: https://habr.com/ru/post/1201153/


All Articles