Why is an instanced array slower than glDrawElement?

in my program I want to draw a lot of spheres. 1st I create vertices, indices for the sphere then connect them to voa / vbo / ibo. After that, I create 1000 random model matrices. Now I have 2 ways to draw a grid.

  • just run the loop 1000 times through the ModelMatrices list calling glDrawElements . Where the MVP matrix is ​​calculated on the processor and sent to the shader, as in the form.
  • bind all Matrices to an additional VBO and send them to the shader, for example, as an "in" variable. Then call once using glDrawElementsInstanced .

in the test program, I draw 1000 spheres (about 20 mm vertices). When I use the 1st method, I get about 27FPS, and the 2nd decrease to 19FPS. In theory, the 2nd method should achieve higher performance than the 1st.

Here is the code.

I think the bottleneck is this multiplication in the vertex shader (VP * ModelMatrix) , because it needs to be done for each (vertex in the grid) * 1000.

What can be updated and what am I doing wrong?

+4
source share
2 answers

Instancing does not always win. This is a kind of optimization that needs to be profiled to see if it is worth doing.

In general, instancing is a victory if you render a lot of instances (1000 is a bit, but not enough. Think 10,000) that contain a small number of vertices (20,000, probably too many. -3000 or so). In addition, your data for each instance is unnecessarily large; you use a matrix when you can easily use a vector and a quaternion.

The purpose of instance is to reduce processor overhead. In particular, CPU overhead per call and state change. With 20 million peaks, the chances that the processor overhead per 1000 calls and state changes are not your biggest problem.

+10
source

Since you have rotationally invariant spheres, you can replace your matrix with a simple translation of vec3 (perhaps with w = single scale?). I'm not sure if this will change anything, although you are rarely associated with ALU. But 20M peaks are pretty many.

1000 callbacks / frames are within the range that a PC can handle (usually should be <3000), which explains the fact that the simple version is not too slow.

Regarding poor instancing performance, I really don't know, but I suspect this is due to your colossal 20k vertices / grids. Instancing was designed for fairly small grids, so the GPU may not be able to handle this. Could you try comparing with smaller grids (200 vertices) with Vsync? I am curious.

+5
source

Source: https://habr.com/ru/post/1438596/


All Articles