What is the correct way to calculate FPS if the GPU has a task queue and is asynchronous?

I always assumed that the correct way to calculate FPS was to just spend time doing an iteration of the drawing cycle. And most of the Internet seems to be in agreement.

But!

The modern graphics card is considered as asynchronous servers, therefore the drawing cycle sends drawing commands for the given vertices / textures / etc already on the GPU. These calls do not block the calling thread until the request for the GPU is complete, they are simply added to the GPU task queue. So of course, the ā€œtraditionalā€ (and rather ubiquitous) method simply measures the time it takes to send a call?

Which prompted me to ask if I had implemented the traditional method, and it gave constantly absurdly high frame rates, even if what was done caused animation to come to life. Re-reading my OpenGL SuperBible led me to glGenQueries, which let me create sections of the rendering pipeline.

To summarize, is the ā€œtraditionalā€ way of calculating FPS completely non-existent with (barely) modern graphics cards? If so, why are GPU profiling methods relatively unknown?

+6
source share
2 answers

FPS measurement is difficult. This is complicated by the fact that different people who want to measure fps do not necessarily want to measure the same thing. So ask yourself about it. Why do you want a fps number?

Before I start and dive into all the pitfalls and potential solutions, I want to note that this is by no means a problem that is characteristic of ā€œmodern video cardsā€. In any case, this was worse with SGI-type machines, where the rendering actually took place on a graphical subordinate system, which could be remote for the client (as in a physically remote one). GL1.0 was actually defined in terms of client-server.

Anyway. Let's get back to the problem.

fps, which means the number of frames per second, is really trying to convey in one number an approximate idea of ā€‹ā€‹the performance of your application in an amount that can be directly related to things like the screen refresh rate. to approximate level 1 performance, this works well. It breaks completely as soon as you want to delve into a finer-grained analysis.

The problem is that the thing that is important for everything that relates to the ā€œfeeling of smoothnessā€ of the application is when the image you painted ends on the screen. The second thing that matters quite a bit is how long it took between how you activated the action and when its effect appears on the screen (general latency).

As the application draws a series of frames, it sends them according to the times s0, s1, s2, s3, ... and they end the screen display at t0, t1, t2, t3, ...

To feel smooth, you need all of the following things:

  • tn-sn is not too high (latency)
  • t (n + 1) -t (n) small (less than 30 ms)
  • There is also a strict time limit for delta modeling, which I will discuss later.

When you measure the processor time for your rendering, you end up measuring s1-s0 to approximate t1-t0. As it turned out, this, on average, is not far from the truth, since the client code will never be "too far ahead" (this assumes that you are still rendering frames). See below for other cases). What actually happens is that GL ends up blocking the CPU (usually at SwapBuffer time) when it tries to go too far ahead. This blocking time is the extra time spent by the GPU compared to the CPU on a single frame.

If you really want to measure t1-t0, as you mentioned in your own post, the queries are closer to it. But ... Everything is never so simple. The first problem is that if you are connected to a processor (this means that your processor is not fast enough to always work with the GPU), then part of the time t1-t0 is actually the GPU free time. This will not be captured by the request. The next problem you click on is that depending on your environment (display layout environment, vsync), queries may actually measure the time that your application spends rendering on the back buffer, which is not the full rendering time (since the display does not was updated at that time). This gives you an idea of ā€‹ā€‹how long the rendering will take, but it will also not be accurate. Please note that requests are also prone to asynchronous graphics. Therefore, if your GPU is inactive for some time, the request may skip this part. (for example, they say that your processor takes a very long time (100 ms) to send your frame. The GPU executes a full frame in 10 ms. Your request will probably report 10 ms, although the total processing time was closer to 100 ms ...).

Now, as for the "event-rendering", and not the continuous one, which I already spoke about. fps does not make much sense for these types of workloads because the goal is not to draw as many f per s as possible. There, the natural metric for GPU performance is ms / f. However, this is only a small part of the picture. What really matters is the time it took from the moment you decided to refresh the screen and the time it appeared. Unfortunately, this number is hard to find: it usually starts when you get an event that starts the process and ends when the screen refreshes (which you can only measure with a camera that takes on-screen output ...).

The problem is that between 2, you have a potential overlap between the processor and GPU processing or not (or even some delay between the processor stop sending commands and the GPU starting to execute them). And it completely depends on the implementation of the solution. The best you can do is call glFinish at the end of the render to know for sure that the GPU has processed the commands you sent and measured the time on the CPU. This solution reduces the overall processor performance and possibly the GPU side if you are going to send the next event right after ...

The latest discussion on ā€œhard delta modeling time limitsā€:

A typical animation uses delta time between frames to move the animation forward. The main problem is that for a completely smooth animation, you really need to use delta time when you send your frame to s1 equal to t1-t0 (so when t1 shows, the time that was actually spent from the previous frame really was t1 -t0). The problem, of course, is that you have no idea that t1-t0 at the time you sent s1 ... So you usually use the approximation. Many just use s1-s0, but it can break - for example. Systems such as SLI may have some delays in rendering AFR between different GPUs). You can also try using the approximation t1-t0 (or, more likely, t0-t (-1)) through queries. The result of this is most likely micro-stuttering on SLI systems.

The most reliable solution is to say ā€œblock up to 30 frames per second and always use 1/30 secondsā€. This is also the one that allows the least freedom of action on content and equipment, since you have to make sure that rendering can really be done in these 33 ms ... But what some console developers prefer to do (fixed hardware makes it a little easier )

+12
source

"And most of the Internet seems to fit." for me itā€™s not entirely correct:

Most publications will measure how long it will take for LOTS of Iterations, and then normalize. Thus, you can reasonably assume that filling (and eliminating) the pipe is only a small part of the total time.

+1
source

Source: https://habr.com/ru/post/885684/


All Articles