Is binary operation faster than memmove?

I am writing a digital filter and I need to save the latest X values โ€‹โ€‹and sum them all together.

Now there are two possible approaches. Or, I move the entire array with memmove to free up space for the next value, and have the necessary indexes for the array as hardcoded values โ€‹โ€‹in my summing algorithm.

 memmove(&Fifo[0], &Fifo[1], 12 * 4); // Shift array to the left Result += Factor[1] * (Fifo[5] + Fifo[7]); Result += Factor[2] * (Fifo[4] + Fifo[8]); Result += Factor[3] * (Fifo[3] + Fifo[9]); Result += Factor[4] * (Fifo[2] + Fifo[10]); Result += Factor[5] * (Fifo[1] + Fifo[11]); Result += Factor[6] * (Fifo[0] + Fifo[12]); 

Or, on the contrary, I do not copy any memory, but instead increase the counter and calculate each index from this using the modulo operation (for example, a circular buffer).

 i++; // Increment the index Result += Factor[1] * (Fifo[(i + 5) % 13] + Fifo[(i + 7) % 13]); Result += Factor[2] * (Fifo[(i + 4) % 13] + Fifo[(i + 8) % 13]); Result += Factor[3] * (Fifo[(i + 3) % 13] + Fifo[(i + 9) % 13]); Result += Factor[4] * (Fifo[(i + 2) % 13] + Fifo[(i + 10) % 13]); Result += Factor[5] * (Fifo[(i + 1) % 13] + Fifo[(i + 11) % 13]); Result += Factor[6] * (Fifo[(i + 0) % 13] + Fifo[(i + 12) % 13]); 

With its integrated ARM processor, I was wondering what would be more efficient. Since I assume that the CPU must move at least one 32-bit value internally in order to perform a modulo operation, can it be that simply moving the entire array is as fast as calculating the correct indexes?

+4
source share
5 answers

I did just that on the Cortex M0 (LPC11C14) for a FIR filter of size 15 (Savitzky-Golay for measuring line voltage).

I found that in my case, copying was somewhat slower than using a circular buffer of size 16 and calculating indices using the modulo operator. Note that 16 is the power of two, which makes division very cheap.

I tried several options and used port output to measure runtime, I recommend that you do the same.

+2
source

If you need to know which is faster, you need to do a test. If you want to know why you need to learn the assembly.

So there is also a halfway solution that can be good enough: Use the buffer more than you need, and only memmove when your buffer is full. Thus, you only need to monitor the start of the offset and not worry about problems with circular buffers. However, you must use more memory.

So, if you want to have 5 elements and use a buffer for 10 elements, you only have to memmove every 5 inserts. (Except for the first pass, when you can do 10 inserts)

+2
source

Assuming 32-bit values, Modulo on ARM can be executed in 2 assembly instructions, but it moves the memory (1 to get it in the register, 1 to get it). Therefore, there is no final answer; it will depend on the code around it.

I get the feeling that you should go for a round buffer approach.

+1
source

There is a third method that requires neither memmove nor modulo involving two switching blocks. I'm too lazy to type it, but the idea is that you calculate the offset, use the first switch to calculate the "half" of the buffer, then recalculate the offset and use the second switch to calculate the other half of the buffer. You basically enter the second switch, where the first is "left." Please note that in one switch block you will need to return the order of the commands.

0
source

My intuition says that memmove can cause all kinds of conflicts and prevent internal bypasses, as you load and store them in the same area, possibly even in the same cache lines. Some processors simply abandoned optimization and postponed all memory operations, effectively serializing them (the built-in processor can be simple enough to do this anyway, but I'm talking about the general case - on x86 or even in a15 cortex you can get a bigger penalty )

0
source

Source: https://habr.com/ru/post/1500504/


All Articles