32x32 Multiply and add optimizations

I am working on optimizing the application. I found that I need to optimize the inner loop for better performance. rgiFilter is a 16-bit arranger.

for (i = 0; i < iLen; i++) {
    iPredErr = (I32)*rgiResidue;
    rgiFilter = rgiFilterBuf;
    rgiPrevVal = rgiPrevValRdBuf + iRecent;
    rgiUpdate = rgiUpdateRdBuf + iRecent;

    iPred = iScalingOffset;

    for (j = 0; j < iOrder_Div_8; j++) {


                 iPred += (I32) rgiFilter[0] * rgiPrevVal[0]; 
                 rgiFilter[0] += rgiUpdate[0];

                 iPred += (I32) rgiFilter[1] * rgiPrevVal[1]; 
                 rgiFilter[1] += rgiUpdate[1];

                 iPred += (I32) rgiFilter[2] * rgiPrevVal[2]; 
                 rgiFilter[2] += rgiUpdate[2];

                 iPred += (I32) rgiFilter[3] * rgiPrevVal[3]; 
                 rgiFilter[3] += rgiUpdate[3];

                 iPred += (I32) rgiFilter[4] * rgiPrevVal[4]; 
                 rgiFilter[4] += rgiUpdate[4];

                 iPred += (I32) rgiFilter[5] * rgiPrevVal[5]; 
                 rgiFilter[5] += rgiUpdate[5];

                 iPred += (I32) rgiFilter[6] * rgiPrevVal[6]; 
                 rgiFilter[6] += rgiUpdate[6];

                 iPred += (I32) rgiFilter[7] * rgiPrevVal[7]; 
                 rgiFilter[7] += rgiUpdate[7];

                    rgiFilter += 8;
        rgiPrevVal += 8;
                    rgiUpdate += 8;



}

ode here

+3
source share
9 answers

Your bet is only to do more than one operation at a time, and this means one of these 3 options:

  • SSE instructions (SIMD). You process multiple memory locations with a single command.
  • (MIMD). , 1 . , ( , , , , ). , , ( , ). 4 .
  • . , , .
+5

rgiFilterBuf, rgiPrevValRdBuf rgiUpdateRdBuf - , , restrict. .

, , (, SSE, x86). .

+3

, , C. SIMD, , , SIMD, ...

+2

SSE2

. [_mm_madd_epi16] [1],

iPred += (I32) rgiFilter[] * rgiPrevVal[];

[_mm_add_epi16] [2] _ [mm_add_epi32] [3],

rgiFilter[] += rgiUpdate[];

.

Microsoft Intel Compilers. , GCC. .

EDIT: ...

, , . , .

  • rgiFilter [] I32 . .
  • iPred iPred [], I32
  • iPred [] ( )

  • iPred [0] + = rgiFilter [0] * rgiPrevVal [0];

    iPred [1] + = rgiFilter [1] * rgiPrevVal [1];

    iPred [2] + = rgiFilter [2] * rgiPrevVal [2];

    iPred [3] + = rgiFilter [3] * rgiPrevVal [3];

    rgiFilter [0] + = rgiUpdate [0];

    rgiFilter [1] + = rgiUpdate [1];

    rgiFilter [2] + = rgiUpdate [2];

    rgiFilter [3] + = rgiUpdate [3];

Intel,

+1
  • , iPred ( + =).
  • 1- . , 3 . .
0

Loop ​​.

. Gcc

0

, , . .

SSE ( - ), , ( 0.. 8). , ( , /SSE ).

16- 32/64 ( 16- ). 32 ( 64- 32- , afaik).

0

.

: .

. , , , ,

  • .

  • , iPred.

  • , rgiUpdate rgiFilter.

, , , , .

0

, . .

-, , , , .

:

for (i = 0; i < iLen; i++) {

for (i = iLen-1; i <= 0; i--) {

, 0 , .

, , ( ).

. (), , .

for (p = rgiFilter; p <= rgiFilter+8; ) {
     iPred += (I32) (*p) + *rgiPreval++;
     *p++ += *rgiUpdate++;

     ....

}

. . , , , , . , , /. , , rgiFilter 16- , , - 32- 32- .

for (p = rgiFilter; p <= rgiFilter+8; ) {
     I16 x = *p;
     I16 y = *(p+1); // Hope that the compiler can combine these loads
     iPred += (I32) x + *rgiPreval++;
     iPred += (I32) y + *rgiPreval++;

     *p++ += *rgiUpdate++;
     *p++ += *rgiUpdate++; // Hope that the complier can combine these stores

     ....

}

/ , . , gcc :

__builtin_prefetch (const void * addr)
__builtin_prefetch (const void * addr, int rw)
__builtin_prefetch (const void * addr, int rw, int locality)

, , , , addr . , . rw , . , . , , , , , , .

In addition, since the __builtin_ functions are special, the normal rules for variable number of arguments do not actually apply - this is a hint to the compiler, not a function call.

You should also examine any vector operations that your target supports, as well as any general or platform-specific functions, built-in functions, or pragmas that your compiler supports to perform vector operations.

0
source

Source: https://habr.com/ru/post/1756696/


All Articles