Benchmarking Code - Am I Doing It Right?

I want to compare C / C ++ code. I want to measure processor time, wall time and cycles / bytes. I wrote some memory functions, but I have a problem with loops / bytes.

To get the processor time, I wrote the getrusage() function with RUSAGE_SELF , for wall time I use clock_gettime with MONOTONIC to get loops / bytes, I use rdtsc .

I am processing an input size buffer, for example 1024: char buffer[1024] . How to check:

  • Do the warm-up phase, just call fun2measure(args) 1000 times:

for(int i=0; i<1000; i++) fun2measure(args);

  • Then do a real-time test for wall time:

    `unsigned long i; double timeTaken; double timeTotal = 3.0; // process 3 seconds

    for (timeTaken = (double) 0, i = 0; timeTaken <= timeTotal; timeTaken = walltime (1), i ++) fun2measure (arg); `

  • And for processor time (almost the same):

    for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = walltime(1), i++) fun2measure(args);

But when I want to get the processor cycle counter for a function, I use this piece of code:

 `unsigned long s = cyclecount(); for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = walltime(1), i++) { fun2measure(args); } unsigned long e = cyclecount(); unsigned long s = cyclecount(); for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = cputime(1), i++) { fun2measure(args); } unsigned long e = cyclecount();` 

and then, count cycle / byte: ((e - s) / (i * inputsSize); here inputsSize is 1024 because its length is buffer . But when I go up to totalTime for 10 s, I get strange results:

in 10 seconds:

 Did fun2measure 1148531 times in 10.00 seconds for 1024 bytes, 0 cycles/byte [CPU] Did fun2measure 1000221 times in 10.00 seconds for 1024 bytes, 3.000000 cycles/byte [WALL] 

for 5s:

 Did fun2measure 578476 times in 5.00 seconds for 1024 bytes, 0 cycles/byte [CPU] Did fun2measure 499542 times in 5.00 seconds for 1024 bytes, 7.000000 cycles/byte [WALL] 

for 4s:

 Did fun2measure 456828 times in 4.00 seconds for 1024 bytes, 4 cycles/byte [CPU] Did fun2measure 396612 times in 4.00 seconds for 1024 bytes, 3.000000 cycles/byte [WALL] 

My questions:

  • Are the results obtained?
  • Why, when I increase the time, I always get 0 cycles / bytes in the processor?
  • How can I measure statistics of mean time, mean, standard deviation, etc. for such benchmarking?
  • Is my benchmarking method 100% normal?

CHEERS!

1st EDIT:

After changing i to double :

 Did fun2measure 1138164.00 times in 10.00 seconds for 1024 bytes, 0.410739 cycles/byte [CPU] Did fun2measure 999849.00 times in 10.00 seconds for 1024 bytes, 3.382036 cycles/byte [WALL] 

my results look ok. So question number 2 is no longer a question :)

+3
source share
1 answer

The benchmark test is erroneous because it includes the cost of calling walltime / cputime functions. In general, I urge you to use the right profiler instead of trying to reinvent the wheel. Especially performance counters will give you numbers you can rely on. Also note that the loops are very unreliable, since the processor usually does not work at a fixed frequency, or the kernel may switch to a task and temporarily suspend your application.

I personally write tests so that they perform a given function N times, since N was large enough to get enough samples. Outwardly, I use a profiler like linux perf to get some hard numbers. By repeating the test for a given time, you can calculate the stddev / avg values ​​that you can do in a script that runs the test several times and evaluates the profiler output.

+1
source

Source: https://habr.com/ru/post/1493485/


All Articles