ARM Assembly: absolute value function: are two or three lines faster?

In my embedded systems class, we were asked to rewrite this C-function AbsVal in the ARM Assembly. We were told that the best we could do was 3 lines. I decided to find a two-line solution and, in the end, did, but the question I have is whether I really reduced performance or increased it .

C code:

unsigned long absval(signed long x){ unsigned long int signext; signext = (x >= 0) ? 0 : -1; //This can be done with an ASR instruction return (x + signet) ^ signext; } 

3-linear solution TA / Professor

 ASR R1, R0, #31 ; R1 <- (x >= 0) ? 0 : -1 ADD R0, R0, R1 ; R0 <- R0 + R1 EOR R0, R0, R1 ; R0 <- R0 ^ R1 

My two-line solution

 ADD R1, R0, R0, ASR #31 ; R1 <- x + (x >= 0) ? 0 : -1 EOR R0, R1, R0, ASR #31 ; R0 <- R1 ^ (x >= 0) ? 0 : -1 

There are several places where I see potential differences in performance:

  • Adding one additional arithmetic shift Right call
  • Delete one memory sample

So which one is faster? Does it depend on the speed of access to the processor or memory?

+6
source share
2 answers

Dive into ARM.com and grab the Cortex-M3 datasheet . Section 3.3.1 on page 3-4 has instruction timings. Fortunately, they are pretty simple on the Cortex-M3.

It can be seen from these timings that in the ideal system “without waiting”, your professor example takes 3 cycles:

 ASR R1, R0, #31 ; 1 cycle ADD R0, R0, R1 ; 1 cycle EOR R0, R0, R1 ; 1 cycle ; total: 3 cycles 

and your version takes two cycles:

 ADD R1, R0, R0, ASR #31 ; 1 cycle EOR R0, R1, R0, ASR #31 ; 1 cycle ; total: 2 cycles 

So yours is theoretically faster.

You mentioned "Delete a single memory sample", but is that true? How big are the procedures? Since we are dealing with Thumb-2, we have a combination of 16-bit and 32-bit instructions. Let's see how they are going:

Their version (taking into account UAL syntax):

  .syntax unified .text .thumb abs: asrs r1, r0, #31 adds r0, r0, r1 eors r0, r0, r1 

Assembled:

 00000000 17c1 asrs r1, r0, #31 00000002 1840 adds r0, r0, r1 00000004 4048 eors r0, r1 

This is 3x2 = 6 bytes.

Your version (again, configured for UAL syntax):

  .syntax unified .text .thumb abs: add.w r1, r0, r0, asr #31 eor.w r0, r1, r0, asr #31 

Assembled:

 00000000 eb0071e0 add.w r1, r0, r0, asr #31 00000004 ea8170e0 eor.w r0, r1, r0, asr #31 

This is 2x4 = 8 bytes.

Therefore, instead of deleting the memory sample, you actually increased the code size.

But does this affect performance? My advice: > .

+4
source

Here are two more versions of the instructions:

  cmp r0, #0 rsblt r0, r0, #0 

What translate to simple code:

  if (r0 < 0) { r0 = 0-r0; } 

This code should be pretty fast even on modern ARM-CPU cores like Cortex-A8 and A9.

+5
source

Source: https://habr.com/ru/post/944775/


All Articles