Processor Algorithm / Implementation Benchmarking

Question

Processor Algorithm / Implementation Benchmarking

Let's say I write my own StringBuilder in a compiled language (e.g. C ++).

What is the best way to measure the performance of various implementations? A simple schedule of several hundred thousand launches gives very contradictory results: timings from one batch to another can vary by as much as 15%, which makes it impossible to accurately assess potential productivity improvements that bring profit lower than that.

I have done the following:

Disable SpeedStep
Use RDTSC for synchronization
Run a process with real-time priority
Establish proximity to a single core CPU

This somewhat stabilized the results. Any other ideas?

+3

performance benchmarking cpu

Vladimir Panteleev Dec 25 '11 at 7:50

source share

3 answers

I achieved 100% results this way:

Configure Bochs with MS-DOS.
Configure your binding to the target MS-DOS
- or -
- Configure your 32-bit Windows targeting toolchain
- Install the HX-DOS Extender in Bochs.
- If necessary, hack the standard library / runtime of the tools and run / remove functions that require the use of Windows APIs that are not implemented in HX-DOS. When you try to start the program, the expander will print a list of failed APIs.
Reduce the number of cycles in your test by several orders of magnitude.
Wrap the test code using the cli / sti assembler instructions (note that after this change, the binary will not work on modern operating systems).
If you haven’t done so yet, do your test to use rdtsc deltas to select a time. Samples must be in cli & hellip; sti .
Launch it in Bochs!

The result seems completely deterministic, but not an accurate estimate of overall performance (see discussion in Osman Turan's answer for details).

As a bonus tip, you can easily share files with Bochs here (so you don’t need to disconnect / rebuild / mount the floppy disk image every time):

On Windows, Boch blocks a floppy disk image file, but the file is still open in shared write mode. This means that you cannot overwrite the file, but you can write it. (I think * nix OSes can overwrite to create a new file as far as file descriptors are concerned.) The trick is to use dd . I had the following script installation:

 ... benchmark build commands here ... copy /YC:\Path\To\Benchmark\Project\test2dos.exe floppy\test2.exe bfi -t=288 -f=floppysrc.img floppy dd if=floppysrc.img of=floppy.img

bfi is Bart Build a floppy image .

Then just install floppy.img in Bochs.

Bonus Tip # 2: To avoid having to manually run the test every time in Bochs, put the empty go.txt file in the floppy disk directory and run this batch in Bochs:

 @echo off A: :loop choice /T:y,1 > nul if not exist go.txt goto loop del go.txt echo --------------------------------------------------- test2 goto loop

It runs a test program every time it detects a fresh floppy disk. Thus, you can automate the launch of the test in one script.

Update: this method is not very reliable. Sometimes the timings would change by as much as 200%, simply reordering some tests (these time changes were not observed when working on real hardware, using the methods described in the original question).

+7

Vladimir Panteleev Dec 26 '11 at 13:36

source share

In the past, I was very worried about this problem, and I came to understand that there is only one perfect solution that requires a lot of work (preparation mainly), so I never did it like that.

The solution is to run your code using the 386 emulator, which will tell you how many operations have been completed. You should be able to find the 386 open source emulator. It will be accurate for instructions, and this will require one test run. If you do, write how you did it!

+1

Mike nakis Dec 25 '11 at 9:41

source share

Osman turan · Accepted Answer · 2011-12-25T10:00:57+0000

It is very difficult to accurately measure a piece of code. For these requirements, I recommend that you familiarize yourself with the Agner Fog test suite . Using it, you can measure clock cycles and collect some important factors (such as cache misses, incorrect branch predictions, etc.).

In addition, I recommend that you look at the PDF document from the Agner website. This is a truly invaluable document to enable such micro-optimization.

As a side note, actual performance is not a feature of "clock cycles." Cache misses can change everything for every run in a real application. So, first I would optimize the cache misses. Simple execution of a code fragment several times for the same part of memory significantly reduces cache misses. Therefore, it is difficult to measure accurately. Customizing the entire application is usually better than the IMO idea. Intel VTune and other tools are really good for such usages.

Processor Algorithm / Implementation Benchmarking

More articles: