I will not try to answer with confidence how many cycles (3 or 10) it will take to start each iteration, but I will explain how you can get 3 cycles per iteration.
(Please note that this is for processors in general, and I do not make references specific to AMD processors.)
Key concepts:
Most modern (non-embedded) processors today are superscalar and unmanageable. Not only can it execute several (independent) instructions in parallel, but they can reorder instructions to break dependencies, etc.
Let me break your example:
label: mov (%rsi), %rax adc %rax, (%rdx) lea 8(%rdx), %rdx lea 8(%rsi), %rsi dec %ecx jnz label
The first thing to notice is that the last 3 commands in front of the branch are all independent:
lea 8(%rdx), %rdx lea 8(%rsi), %rsi dec %ecx
Thus, the processor can execute all three of them in parallel.
Another thing:
adc %rax, (%rdx) lea 8(%rdx), %rdx
It seems that there is a dependency on rdx , which prevents the parallel operation of two of them. But actually this is a false dependency, because the second instruction is practically independent of the output of the first command. Modern processors can rename the rdx register rdx that these two commands can be rdx or executed in parallel.
The same applies to the rsi register between:
mov (%rsi), %rax lea 8(%rsi), %rsi
Thus, as a result, 3 cycles (possibly) are achievable as follows (this is just one of several possible orders):
1: mov (%rsi), %rax lea 8(%rdx), %rdx lea 8(%rsi), %rsi 2: adc %rax, (%rdx) dec %ecx 3: jnz label
* Of course, I simplify things for simplicity. In fact, the delay is probably longer and overlaps between different iterations of the loop.
In any case, this may explain how 3 cycles can be obtained. As for why you sometimes get 10 cycles, there can be many reasons for this: an incorrect industry prediction, some random pipeline bubble ...