Optimization methods

I am looking for the fastest optimization from the function below without entering the assembly, because this seems to be the bottleneck of my application. Keep in mind that the following function has already been declared inline.

: P = 10 and N = 240

void autocorrelation( int32_t *data , float *r){ for ( int m=0 ; m < P+1 ; m++) { register float temp = 0; for ( int n=0 ; n<Nm ; n++) { temp += (float)(data[n])*(float)(data[n+m]); } r[m] = temp; } } 

any help is appreciated.

Thanks!

EDIT:

assembly:

 temp += (float)(data[n])*(float)(data[n+m]); 800063A8 lddsp R8, SP[0x0] 800063AA add R1, R2, R8<<0 800063AE ld.w R12, R1[R7<<0] 800063B2 mcall 0x80006f58 800063B6 mov R4, R12 800063B8 ld.w R12, R2[R7<<0] 800063BC mcall 0x80006f58 800063C0 mov R11, R12 800063C2 mov R12, R4 800063C4 mcall 0x80006f5c 800063C8 mov R11, R12 800063CA mov R12, R5 800063CC mcall 0x80006f60 800063D0 mov R5, R12 for ( int n=0 ; n<Nm ; n++) 800063D2 sub R6, -1 800063D4 sub R7, -4 800063D6 cp.w R6, R3 800063D8 brne 0x800063ae r[m] = temp; 800063DA ld.w R10, PC[2954] 800063DE lddsp R9, SP[0x0] 800063E0 st.w R10[R9<<0], R5 800063E4 sub R0, 1 800063E6 sub R9, -4 800063E8 stdsp SP[0x0], R9 for ( int m=0 ; m < P+1 ; m++) 800063EA cp.w R0, 229 800063EE breq 0x800063fc 800063F0 mov R3, R0 for ( int n=0 ; n<Nm ; n++) 800063F2 cp.w R0, 0 800063F4 brgt 0x800063a2 800063F8 mov R5, 0 800063FA rjmp 0x800063da 

//////////////////////////////////////////////////// //////////////////////////////

So, I changed the code to:

 void autocorrelation( float *data , float *r){ for ( int m=0 ; m < P+1 ; m++) { register float temp = 0; for ( int n=0 ; n<Nm ; n++) { temp += data[n]*data[n+m]; } r[m] = temp; } } 

and reduce the time by a third (each tick 1/16000 Hz) - initially - 108 ticks now - 70 ticks

new build:

 temp += data[n]*data[n+m]; 800063C2 add R2, R3, R0<<0 800063C6 ld.w R11, R3[R7<<0] 800063CA ld.w R12, R2[R7<<0] 800063CE mcall 0x80006f68 800063D2 mov R11, R12 800063D4 mov R12, R5 800063D6 mcall 0x80006f6c 800063DA mov R5, R12 for ( int n=0 ; n<Nm ; n++) 800063DC sub R6, -1 800063DE sub R7, -4 800063E0 cp.w R6, R4 800063E2 brne 0x800063c6 r[m] = temp; 800063E4 ld.w R9, PC[2960] 800063E8 st.w R9[R0<<0], R5 800063EC sub R1, 1 800063EE sub R0, -4 for ( int m=0 ; m < P+1 ; m++) 800063F0 cp.w R1, 229 800063F4 breq 0x80006402 800063F6 mov R4, R1 for ( int n=0 ; n<Nm ; n++) 800063F8 cp.w R1, 0 800063FA brgt 0x800063bc 800063FE mov R5, 0 80006400 rjmp 0x800063e4 

//////////////////////////////////////////////////// //// Final decision: (changed again)

I combined the noted solution and unwound the loop using the application I wrote, and stayed at 64 bits to the end, the performance increased from 60 to 20 beats from the last. Over six functions with the same settings, from the very beginning I was able to get an optimized code for 250 ticks up to 50 ticks, where for ping-pong buffers everything was needed to do this for 160 ticks, so I have a head number:

 void fastAutocorrelation( int64_t *data , float *r){ int64_t *temp; int64_t *datan = data; int64_t *datanm = data; *temp = (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); *r++ = (float)(*temp)/int64ToFloat; datan = data; datanm = data + 1; *temp = (*datan++)*(*datanm++); *temp += (*datan++)*(*datanm++); ... 
+6
source share
4 answers

I see you cut a little by removing the throws. If you are after the tricks of pointers, here is one to try. Depending on your compiler, your mileage will vary:

 void autocorrelation( float *restrict data, float *restrict r) { float *data_end = data + N; for ( int m=0 ; m < P+1 ; m++) { float temp = 0; for( float *data_n = data, *data_nm = data + m; data_nm != data_end; data_n++, data_nm++ ) { temp += *data_n * *data_nm; } r[m] = temp; } } 

I added restrict in the key so that the compiler knows data and r does not overlap or point to the same thing.

+1
source

Please note that your processor does not have any floating point capabilities; floating point operations are emulated in software. This means that the bottleneck is not loop control (which the compiler is already doing a good job of reducing power). The bottleneck is the floating point emulator.

Given that your processor does not have a built-in floating point, it probably also does not have a large L1 cache. Reordering outline controls can improve data location. The source code does 10 scans across an array of 240 elements, which is poor locality. It would be better to make one sweep across the array, studying 10 items at a time.

 void autocorrelation( int32_t *data , float *r){ int m, n; for (m = 0; m < P + 1; m++) r[m] = 0.0f; for (n = 0; n < N; n++) { int limit = min(P + 1, N - n); for (m = 0; m < limit; m++) { r[m] += data[n] * data[n+m]; } } } 

(Note that converting to pointers will not help, because the source code has already been optimized by the compiler to use pointers.)

+3
source

If you feel like some kind of adventurous experimentation, you can try the Intel ICC compiler (was it an x86 assembler?). With the right command line switches that will automatically parallelize your for loops, using separate threads for each iteration of the loop. However, the loops must be fleshy enough for overhead streams to be useful.

Another approach is the descent and pollution of SSE and AVX. Even better, I think that the latest Intel x86 processors have a multiply / add instruction, which is necessary for correlations, FFT, etc. Thus, you will encode your code, and several operations will be performed per cycle (except that the regular CPU pipeline can reach). Effectively, there are some additional β€œfunctions” that can be directly converted to SSE / AVX operation codes, which makes them easy to use in C. The compiler needs to know about this (of course, Intel), otherwise you insert your own assembler online. You also had a problem working with different versions of processors at runtime; not every PC has the latest Intel chip.

Or you can be very lazy like me and use a library of pre-optimized routines such as Intel IPP / MKL. This, like ICC, costs money, but it can be useful if speed = big money in your project.

+1
source

You tried:

1) Advance pointers instead of indexing the r array? (Similar to @paddy for data_n and data_m ). For instance. *(r++) = temp instead of r[m] = temp

2) Deployment of the loop. Your compiler obviously does not, and this, for example, will be a great speed boost for the inner loop. In particular, see http://www.quickiwiki.com/en/Duff 's_device for a quick loop unfolding.

3) Take the loop unwinding to the extreme! (Maybe not really, I know, but it's cool nonetheless): you know the values ​​of N and P : you can just fully unroll the loop and write all the code (albeit long) without any branches. Depending on your architecture (a prefetch pipeline is enough, combined with a silent branch prediction), this can greatly increase the speed. You can even write a small utility that generates a complete deployed loop for you and includes the generated code in your .c file. Or just use macros and metaprogramming.

0
source

Source: https://habr.com/ru/post/956456/


All Articles