Search for an MSB-set of a maximum element in an array

Given an array of len elements of type signed short , it must find the position of the most significant bit set in the element of the maximum absolute value in the array. For example, if an array L contains {-134, 123, 0, -890} , then f(L) should return floor(log2(abs(-890)))+1 .

Here is my current function:

 short MSBSetMaxMagnitude(const short *p, int len) { unsigned int t = 0; while (len > 0) { t |= abs(*p); p++; len--; } if(t) return (short)(32 - __builtin_clz(t)); else return 0; } 

However, it is a bit slow due to the abs () function requiring branching. I tried using abs () without branching instead, but it is even slower since it contains at least 3 arithmetic instructions. Therefore, I hope that perhaps there is an efficient algorithm to find exactly what I need.

+4
source share
2 answers

When you see that you are working on the ARM platform, you can use the following implementation of abs in two instructions:

 EORS r1, r1, r1, ASR #32 (x = x ^ (x >> 32); carry_flag = sign_bit) ADC r1, r1, #0 (add the sign_bit to x) 

If you can carry the +/- 1 error in the calculations, discard the second instruction; then you can express it in C:

 int abs_almost_exact(int x) { return x ^ (x >> 32); } 

But the big problem is, however, the cycle. You are likely to benefit from the deployment (since there is so little to do with each iteration):

 do { // assuming len is even! int value1 = *p++; int value2 = *p++; value1 = abs(value1); // or replace abs by the hand-made version value2 = abs(value2); t |= value1; t |= value2; len--; } while (len > 0); 

Note. I replaced while {} with do {} while , because the compiler I use (the ARM compiler) generates the best code this way.

Also note that ARM has a delay of 2 clock cycles when loading short variables from memory (to the processor I worked with). So the minimum pivot factor is 3 (but you should still pivot).

Oh, and does your processor support reading short (half-words) variables from memory? I have heard of some very old processors that cannot do this. If so, you should change the code to load two values ​​(1 word) at once and use some bit scripts to separate them.

+1
source

Three arithmetic instructions should take very little time on any modern processor. You perform two arithmetic operations and a conditional branch in loop and index control. It is possible that slowness is associated with a combination of misses in the data cache and the loop, which can be difficult to deploy the compiler due to the use of the pointer and pointer arithmetic.

It is impossible to find a value that depends on each element of the array without looking at each element of the array, so the goal should be to make sure that all this is done in the time required to scan the array.

You can check if this is a problem by replacing:

 t |= abs(*p); 

with t | = * p;

If this is not significantly faster, I suggest experimenting with a non-branching version of abs in a manual deployed loop.

0
source

Source: https://habr.com/ru/post/1447764/


All Articles