Performance of a modern processor

Question

Performance of a modern processor

Running on a modern processor (AMD Phenom II 1090T), how many hours does the following code have: 3 or 11?

label: mov (%rsi), %rax adc %rax, (%rdx) lea 8(%rdx), %rdx lea 8(%rsi), %rsi dec %ecx jnz label

The problem is that when many iterations of such code are executed, the results vary from about 3 OR 11 ticks per iteration from time to time. And I can’t decide who is who.

UPD According to the Command Latency Table (PDF) , my piece of code takes at least 10 clock cycles on the AMD K10 microarchitecture. Therefore, impossible 3 ticks per iteration are caused by measurement errors.

solvable @Atom noticed that the frequency of cycles in modern processors is not constant. When I disabled three options in the BIOS - Core Performance Boost , AMD C1E Support and AMD K8 Cool&Quiet Control , the consumption of my "six instructions" stabilized for 3 clock cycles AMD C1E Support

+6

performance assembly x86-64 amd-processor

leventov Dec 29 '11 at 10:36

source share

2 answers

At Intel, Dr. David Levintal's "Performance Analysis Guide" reviews in detail the answers to such questions.

+2

Crashworks Dec 29 '11 at 10:41

source share

Mysticial · Accepted Answer · 2011-12-30T06:39:35+0000

I will not try to answer with confidence how many cycles (3 or 10) it will take to start each iteration, but I will explain how you can get 3 cycles per iteration.

(Please note that this is for processors in general, and I do not make references specific to AMD processors.)

Key concepts:

Most modern (non-embedded) processors today are superscalar and unmanageable. Not only can it execute several (independent) instructions in parallel, but they can reorder instructions to break dependencies, etc.

Let me break your example:

 label: mov (%rsi), %rax adc %rax, (%rdx) lea 8(%rdx), %rdx lea 8(%rsi), %rsi dec %ecx jnz label

The first thing to notice is that the last 3 commands in front of the branch are all independent:

  lea 8(%rdx), %rdx lea 8(%rsi), %rsi dec %ecx

Thus, the processor can execute all three of them in parallel.

Another thing:

 adc %rax, (%rdx) lea 8(%rdx), %rdx

It seems that there is a dependency on rdx , which prevents the parallel operation of two of them. But actually this is a false dependency, because the second instruction is practically independent of the output of the first command. Modern processors can rename the rdx register rdx that these two commands can be rdx or executed in parallel.

The same applies to the rsi register between:

 mov (%rsi), %rax lea 8(%rsi), %rsi

Thus, as a result, 3 cycles (possibly) are achievable as follows (this is just one of several possible orders):

 1: mov (%rsi), %rax lea 8(%rdx), %rdx lea 8(%rsi), %rsi 2: adc %rax, (%rdx) dec %ecx 3: jnz label

* Of course, I simplify things for simplicity. In fact, the delay is probably longer and overlaps between different iterations of the loop.

In any case, this may explain how 3 cycles can be obtained. As for why you sometimes get 10 cycles, there can be many reasons for this: an incorrect industry prediction, some random pipeline bubble ...

Performance of a modern processor

More articles: