Which method is faster in assembly language: adding by variables or adding by immediate value? What for?

For instance:

; Method 1 .data val1 DWORD 10000h .code add eax,val1 

vs:

 ; Method 2 .code add eax,10000h 

Which method will be executed faster after compilation (assembly)? I think that method 2 will generate faster code, because the CPU will not have to read the value from the main memory before adding it to the eax register. In my answer, I am not so clear, can anyone help?

+4
source share
4 answers

10000h will be read from memory regardless of whether it is either from its location in the data memory or from its location in the instruction memory. For lower constant values, the CPUs provide special instructions that do not require additional space for the added value, but this depends on the specific architecture. Adding an immediate action is likely to be faster due to caching: by the time the command is decoded, the constant will be in the cache and the addition will be very fast.

A small note that is different from the topic: your example shows the case when the optimizing C compiler creates faster code than a handwritten compilation: instead of adding 10000h, the optimizer can increase the upper half-word by one and leave the lower half-layer as it is.

+3
source

In all likelihood, it will be dependent on the situation, and the difference may not even be noticeable.

Factors such as out-of-order execution are likely to obscure any inherent โ€œslownessโ€ of any version, unless a bottleneck actually exists.

However, if we had to choose which one is faster, then you are right that the second case will most likely be faster.

If we look at the Agner Fog tables for all current x86 processors:

Core 2:

 add/sub r, r/i Latency = 1 , 1/Throughput = 0.33 add/sub r, m Latency = unknown , 1/Throughput = 1 

Nehalem:

 add/sub r, r/i Latency = 1 , 1/Throughput = 0.33 add/sub r, m Latency = unknown , 1/Throughput = 1 

Sand Bridge:

 add/sub r, r/i Latency = 1 , 1/Throughput = 0.33 add/sub r, m Latency = unknown , 1/Throughput = 0.5 

K10:

 add/sub r, r/i Latency = 1 , 1/Throughput = 0.33 add/sub r, m Latency = unknown , 1/Throughput = 0.5 

In all cases, the version of the memory operand has less bandwidth. The delay is unknown in all cases, but there will almost certainly be more than 1 cycle. So this is worse in all factors.

In versions of the memory operand, all the same execution ports are used as for the immediate version +; for this, a memory reader port is also required. It can only make the situation worse. In fact, that is why the bandwidth is lower with the memory operand - the memory ports can only support 1 or 2 reads / cycle, while the adder can support full 3 / cycle.

In addition, this assumes that the data is in the L1 cache. If this is not the case, the memory operand version will be MUCH slower.


Taking this step further , we can examine the size of the encoded instructions:

 add eax,val1 -> 03 05 14 00 00 00 add eax,10000h -> 05 00 00 01 00 

The encoding for the first may vary slightly depending on the address val1 . The examples I showed here are in my specific test case.

Thus, the memory access version requires an extra byte for encoding - which means a slightly larger code size - and potentially more i-cache misses in the worst case.


So, in conclusion, if there is a difference in performance between versions, most likely, the immediate will be faster, because:

  • It has lower latency.
  • It has higher bandwidth.
  • It has shorter coding.
  • He does not need to access the data cache, which could potentially be a cache skip.
+5
source

Adding immediate (your magic hex value) is really faster (on architectures that I know of, at least).

I think the question is how much. Now I believe that it depends on whether val1 is cached or not.

In case it is NOT cached, it is very slow, since memory access is heaps slower than cache access (in fact, cache level l1 is the fastest).

In case it really caches, the results are in my stern opinion pretty close to each other.

+3
source

I did not build after a while, but I believe that this code is not equivalent.

In method 1, you add the address val1 to eax, in method 2 you add the constant value 10000h to eax ... To add the contents of the variable that you would need to do

 add eax,[val1] 

and it will be slower because it will lead to memory reading. And this code may not even be legal. Shouldn't you do something like:

 mov ecx, val1 add eax, [ecx] 

As I said, my Intel build is pretty rusty :)

0
source

Source: https://habr.com/ru/post/1389768/


All Articles