How to measure performance regardless of machine used

I had a routine that worked well. However, I had to make changes to it. The change improved routine accuracy, but hurt performance.

A subprogram is a lot of mathematical calculations and is probably related to the CPU (I still have to do more rigorous tests on this, but I'm 99% sure). It is written in C ++ (compiler - Borland C ++ 6).

I want to measure the performance of a routine now, at first I thought about measuring the runtime, but this is a kind of erroneous approach, in my opinion, since there can be many more things.

Then I came across this topic: Methods for measuring application performance - stack overflow . I liked the idea of ​​measuring through MFlops.

My boss suggested trying to use some kind of measurement with the processor cycles, so the tests will be machine independent, however I think this approach gets into MFlops testing.

In my opinion, measuring both things (runtime and MFlops) is the way to go, but I would like to hear from stackoverflow experts what you guys think.

How can I measure the performance of a subroutine called CPU link?

+4
source share
6 answers

The processor clock cycles are also not so significant if your application is connected with memory. On a faster processor, you will simply spend more processor cycles waiting for the same cache miss. (Math applications are probably not related to I / O).

Another problem is that the number of clock cycles for a certain sequence of instructions will still vary depending on the architecture (and this even applies to Intel Core1 / Core2). Thus, as an absolute measure of performance, clock cycles on a single CPU are unlikely to improve.

I would say that they are actually worse than a measure. Unlike time, users do not need cycles. This is especially important for modern multi-core processors. "Ineffective" algorithms that use twice as many cycles and 3 cores will complete in 67% of cases. Users will probably like it.

+5
source

Your question implies that the software is already as fast as it could go, with the exception of accuracy. I found that this is not so common, and I assume that you really want to do it so quickly.

I would suggest that there is no measurement.

What you really need to do is find instructions or instructions (not functions) 1) that are responsible for a significant fraction of the wall clock time, and 2) that you can find a way to optimize.

Assuming the software is non-trivial in size, it may have at least several levels of function calls, and it is possible that some of these function calls (not functions, function calls ) are responsible for a significant fraction of the time and can be optimized.

This is a very good way to find them, and this is an example of its use.

+3
source

I agree with your boss - a measure in terms of processor clock cycles. Keep in mind that there may be other things, such as many cache misses, that slow down your code. If you can, use VTune or one of Intel’s free tools to determine the nature of the bottleneck.

+2
source

CPU clock cycles are currently not machine independent, even with processors using the same instruction set. Machine code x86 (or whatever) will cut and dice in different ways. The days when it meant something had long gone (and when the processor cycles meant something, there were so many different types of processors that it depended on the machine anyway).

Not to mention the fact that the binding to the CPU is not as clear as before, that with skipping the cache and all. It used to be that a processor-bound process was limited only to I / O data, etc., since accessing the memory would require a certain number of processor cycles.

What you are trying to measure is the performance that I use as speed. In this case, you are probably best off using the wall clock time, repeating the calculation enough to get significant results. You can create a test harness that will run through various implementations, so you will get comparable results.

+2
source

You can measure in terms of processor hardware counters, Intel VTune profiles are pretty good at that. it will show you detailed information based on CPU counters (Instruction Retired, Cache Misses, Branch Mispredication), it will also correlate this with every statement in your functions, so that you will have a pretty good idea of ​​what consumes the most cost.

it is assumed that your function is not memory related.

thanks

+1
source

Measurement lead time .

In this case, I think you want to minimize what you measure to reduce the number of variables.

Further, it would be nice to run some basic sorting to calibrate this particular machine. Use either the latest verified version or some kind of intensive procedure that roughly corresponds to the type of calculations you are trying to measure. Then you can express this indicator as

relative_time = measured_time_for_routine / measured_time_for_baseline 
0
source

Source: https://habr.com/ru/post/1286137/


All Articles