ARM Assembly: absolute value function: are two or three lines faster?

Question

ARM Assembly: absolute value function: are two or three lines faster?

In my embedded systems class, we were asked to rewrite this C-function AbsVal in the ARM Assembly. We were told that the best we could do was 3 lines. I decided to find a two-line solution and, in the end, did, but the question I have is whether I really reduced performance or increased it .

C code:

unsigned long absval(signed long x){ unsigned long int signext; signext = (x >= 0) ? 0 : -1; //This can be done with an ASR instruction return (x + signet) ^ signext; }

3-linear solution TA / Professor

 ASR R1, R0, #31 ; R1 <- (x >= 0) ? 0 : -1 ADD R0, R0, R1 ; R0 <- R0 + R1 EOR R0, R0, R1 ; R0 <- R0 ^ R1

My two-line solution

 ADD R1, R0, R0, ASR #31 ; R1 <- x + (x >= 0) ? 0 : -1 EOR R0, R1, R0, ASR #31 ; R0 <- R1 ^ (x >= 0) ? 0 : -1

There are several places where I see potential differences in performance:

Adding one additional arithmetic shift Right call
Delete one memory sample

So which one is faster? Does it depend on the speed of access to the processor or memory?

+6

performance optimization assembly arm cortex-m3

Ken w May 11, '13 at 16:41

source share

2 answers

Here are two more versions of the instructions:

  cmp r0, #0 rsblt r0, r0, #0

What translate to simple code:

  if (r0 < 0) { r0 = 0-r0; }

This code should be pretty fast even on modern ARM-CPU cores like Cortex-A8 and A9.

+5

Nils pipenbrinck May 13, '13 at 16:55

source share

David thomas · Accepted Answer · 2013-05-13T23:11:30+0000

Dive into ARM.com and grab the Cortex-M3 datasheet . Section 3.3.1 on page 3-4 has instruction timings. Fortunately, they are pretty simple on the Cortex-M3.

It can be seen from these timings that in the ideal system “without waiting”, your professor example takes 3 cycles:

 ASR R1, R0, #31 ; 1 cycle ADD R0, R0, R1 ; 1 cycle EOR R0, R0, R1 ; 1 cycle ; total: 3 cycles

and your version takes two cycles:

 ADD R1, R0, R0, ASR #31 ; 1 cycle EOR R0, R1, R0, ASR #31 ; 1 cycle ; total: 2 cycles

So yours is theoretically faster.

You mentioned "Delete a single memory sample", but is that true? How big are the procedures? Since we are dealing with Thumb-2, we have a combination of 16-bit and 32-bit instructions. Let's see how they are going:

Their version (taking into account UAL syntax):

  .syntax unified .text .thumb abs: asrs r1, r0, #31 adds r0, r0, r1 eors r0, r0, r1

Assembled:

 00000000 17c1 asrs r1, r0, #31 00000002 1840 adds r0, r0, r1 00000004 4048 eors r0, r1

This is 3x2 = 6 bytes.

Your version (again, configured for UAL syntax):

  .syntax unified .text .thumb abs: add.w r1, r0, r0, asr #31 eor.w r0, r1, r0, asr #31

Assembled:

 00000000 eb0071e0 add.w r1, r0, r0, asr #31 00000004 ea8170e0 eor.w r0, r1, r0, asr #31

This is 2x4 = 8 bytes.

Therefore, instead of deleting the memory sample, you actually increased the code size.

But does this affect performance? My advice: > .

ARM Assembly: absolute value function: are two or three lines faster?

More articles: