Should (at the top of the shader, mat4) uniform matrices be pre-multiplied by the CPU?

Consider a typical “naive” vertex shader:

in vec3 aPos; uniform mat4 uMatCam; uniform mat4 uMatModelView; uniform mat4 uMatProj; void main () { gl_Position = uMatProj * uMatCam * uMatModelView * vec4(aPos, 1.0); } 

Of course, generally accepted wisdom assumes that for each vertex three mats are multiplied, two of which are uniform even after several subsequent calls to glDrawX () in the current shader program, at least these two should be pre-multiplied by the CPU - maybe even all three. "

I am wondering if modern GPUs have optimized this use-case example where CPU-side premultiplication is no longer a performance advantage. Of course, a purist could say: “it depends on the OpenGL implementation of the end user,” but for this use case, we can safely assume that it will be an nVidia or ATI driver with OpenGL 4.2 support that provides this implementation.

From your experience, given that we could "draw" a million vertices on one UseProgram () pass, we would preliminarily multiply at least the first two (projection matrices with perspective and camera transformation) by the UseProgram () coefficient for any degree significant ? What about all three calls to Draw ()?

Of course, it’s all about benchmarking ... but I was hoping that someone has fundamental ideas based on hardware, based on hardware that I skip, which can offer either "do not even try, don’t waste your time "or" do it at all costs, since your current shader without preliminary multiplication would be pure madness "... Thoughts?

+4
source share
1 answer

I wonder if modern GPUs have optimized this use case to the point where CPU-side premultiplication is no longer a performance advantage.

GPUs work best in parallel operations. The only way GPUs can optimize three consecutive vector / matrix multiplications like this is by using the shader compiler to detect that they are uniforms and doing some kind of multiplication somewhere when you issue a draw passing the results to the shader.

Thus, in any case, matrix 3 is multiplied by 1 in the shader. You can either do this multiplication yourself or not. And the driver can either implement this optimization or not. Here's a chart of features:

  | GPU optimizes | GPU doesn't optimize ------------|----------------|--------------------- You send 3 | Case A | Case B matrices | | --------------------------------------------------- You multiply| Case C | Case D on the CPU | | ------------|----------------|--------------------- 

In case A, you get better performance than your code. In case B, you do not get the best performance.

Both cases C and D guarantee you the same performance as with A.

The question is not whether the drivers will implement this optimization. The question is, "what is it worth to you?" If you need this performance, you need to do it yourself; which is the only way to reliably achieve this performance. And if you don't care about performance ... what does it matter?

In short, if you care about this optimization, do it yourself.

From your experience, given that we could “draw” a million vertices on one UseProgram () pass, we would preliminarily multiply at least the first two (projection matrices with perspective and camera conversion) by the UseProgram () coefficient or to a large extent? What about all three calls to Draw ()?

It may be; perhaps this is not so. It all depends on how the vertex transformation is the bottleneck of your rendering system. It is impossible to find out without testing in a real rendering environment.

In addition, combining projection matrices and cameras is not a good idea, since this means that the lighting is in world space, not in camera space. It also makes delayed rendering a lot more difficult, since you don't have a clean projection matrix to get the values ​​out of.

+3
source

Source: https://habr.com/ru/post/1440298/


All Articles