I wonder if modern GPUs have optimized this use case to the point where CPU-side premultiplication is no longer a performance advantage.
GPUs work best in parallel operations. The only way GPUs can optimize three consecutive vector / matrix multiplications like this is by using the shader compiler to detect that they are uniforms and doing some kind of multiplication somewhere when you issue a draw passing the results to the shader.
Thus, in any case, matrix 3 is multiplied by 1 in the shader. You can either do this multiplication yourself or not. And the driver can either implement this optimization or not. Here's a chart of features:
| GPU optimizes | GPU doesn't optimize
In case A, you get better performance than your code. In case B, you do not get the best performance.
Both cases C and D guarantee you the same performance as with A.
The question is not whether the drivers will implement this optimization. The question is, "what is it worth to you?" If you need this performance, you need to do it yourself; which is the only way to reliably achieve this performance. And if you don't care about performance ... what does it matter?
In short, if you care about this optimization, do it yourself.
From your experience, given that we could “draw” a million vertices on one UseProgram () pass, we would preliminarily multiply at least the first two (projection matrices with perspective and camera conversion) by the UseProgram () coefficient or to a large extent? What about all three calls to Draw ()?
It may be; perhaps this is not so. It all depends on how the vertex transformation is the bottleneck of your rendering system. It is impossible to find out without testing in a real rendering environment.
In addition, combining projection matrices and cameras is not a good idea, since this means that the lighting is in world space, not in camera space. It also makes delayed rendering a lot more difficult, since you don't have a clean projection matrix to get the values out of.