Code or Compiler: Optimizing the IIR Filter in C for iPhone 4 and Later

Question

Code or Compiler: Optimizing the IIR Filter in C for iPhone 4 and Later

I profiled my almost finished project, and I see that about three quarters of the processor’s time is spent on this IIR filter function (which is called hundreds of thousands of times in about a second now on the target equipment), so with everything else that works well, I’m interested whether it can be optimized for my specific hardware and software purpose. My goals are only iPhone 4 and newer, only iOS 4.3 and newer, only LLVM 4.x. A bit of inaccuracies are probably good if there is groundwork.

static float filter(const float a, const float *b, const float c, float *d, const int e, const float x) { float return_value = 0; d[0] = x; d[1] = c * d[0] + a * d[1]; int j; for (j = 2; j <= e; j++) { return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j]; } for (j = e + 1; j > 1; j--) { d[j] = d[j - 1]; } return (return_value); }

Any suggestions on speeding up his evaluation, also interested in your opinion, if you can optimize outside the default compiler optimization. I am wondering if this will help NEON SIMD (this is a new platform for me) or if VFP can be used, or if LVVM auto-vectoring will help.

I tried the following LLVM flags:

-ffast-math (there was no noticeable difference)

-O4 (made a big difference on the iPhone 4S with a 25 percent reduction in time, but no noticeable difference on my minimum target iPhone 4 device, the improvement of which is my main goal)

-O3 -mllvm -unroll-allow-partial -mllvm -unroll-runtime -funsafe-math-optimizations -ffast-math -mllvm -vectorize -mllvm -bb-vectorize-aligned-only (LLVM autovectorization flags from Hal Finkel slides here: http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf , made things slower than standard LLVM optimization for the purpose of the Xcode release)

Open to other flags, different approaches and changes to the function. I would prefer to leave only input and return types and values. There is actually a discussion of using NEON's built-in functions for FIR here: https://pixhawk.ethz.ch/_media/software/optimization/neon_support_in_the_arm_compiler.pdf , but I have not enough experience with its subject, apply this information to my own case. Thanks for any clarification.

EDIT My apologies for not noticing this before. After studying the aka.nice sentence, I noticed that the values passed for e, a, and c are always the same values, and I know them before the run time, so the options for including this information are an option.

+4

optimization c floating-point iphone llvm

Halle Oct 15 '12 at 16:57

source share

3 answers

Eric Postpischil · Answer 1 · 2012-10-15T19:06:22+0000

Here are some conversions you could make in your code to use vDSP routines. These conversions use various temporary buffers named T0, T1, and T2. Each of them is a float array with enough space for e-1 elements.

First use a temporary buffer to compute a * b[j] . This changes the source code:

 for (j = 2; j <= e; j++) { return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j]; }

in

 vDSP_vsmul(b+2, 1, &a, T0, 1, e-1); for (j = 2; j <= e; j++) return_value += (d[j] += (d[j+1] - d[j-1])) * T0[j-2];

Then use vDSP_vmul to calculate d[j+1] * T0[j-2] :

 vDSP_vsmul(b+2, 1, &a, T0, 1, e-1); vDSP_vmul(d+3, 1, T0, 1, T1, 1, e-1); for (j = 2; j <= e; j++) return_value += (d[j] += T1[j-2] - d[j-1] * T0[j-2];

Then advance vDSP_vmul to vDSP_vma (vector multiplication) to calculate d[j] + d[j+1] * T0[j-2] :

 vDSP_vsmul(b+2, 1, &a, T0, 1, e-1); vDSP_vma(d+3, 1, T0, 1, d+2, 1, T1, 1, e-1); for (j = 2; j <= e; j++) return_value += (d[j] = T1[j-2] - d[j-1] * T0[j-2];

I suppose I would do that and see if there are any improvements. There are some problems:

SIMD code works best when data is aligned by 16 bytes. Using array indexes such as j-1 and j+1 prevents this. ARM processors in phones are not as bad with data out of alignment as some other processors, but performance will vary from model to model.
If e is large (more than a few thousand), then T0 and d can be removed from the cache during the vDSP_vma operation, and the next cycle will have to reload them. There is a technology called “mining strip” to reduce the effect of this. I will not describe it in detail now, but, in fact, the operation is divided into smaller bands of the array.
The IIR in the last loop may still interfere with the processor. There are procedures in vDSP to execute some IIRs (for example, vDSP_deq22), but it is unclear whether this filter can be expressed in a way that is good enough for the vDSP routine to get more performance than can be lost by the conversion.
Summing in the final loop to calculate the return_value value can also be removed from the loop and replaced with the vDSP routine (probably vDSP_sve), but I suspect that the sag caused by IIR will allow for additions without adding significant execution time to the loop.

Above is above my head; I have not tested the code. I suggest that you do the conversions one by one so that you can test the code after each change and identify any errors before continuing.

If you can find a satisfactory filter that is not IIR, more performance optimization may be available.

justin · Answer 2 · 2012-10-15T18:43:56+0000

I would prefer to leave only input and return types and values ...

However, moving your rendering from float to integer would make it a lot easier.

Localizing this change in your implementation will not be useful. But if you expand it to redefine only FIR as a whole, it can quickly pay off (if the sizes are guaranteed to always be incredibly small - then the conversion / movement time will cost more). Of course, moving large parts of the rendering graph into an integer will increase profits and require even less conversion.

Another consideration is to look at the utilities in Accelerate.framework (potentially saving you from writing your own asm).

aka.nice · Answer 3 · 2012-10-16T22:48:50+0000

I tried this little job to rewrite the filter using the delay statement z

For example, for e = 4, I renamed input u and printed y

 d1*z= u d2*z= c*u + a*d1 d3*z= d2 + a*(d3-d1*z) d4*z= d3 + a*(d4-d2*z) d5*z= d4 + a*(d5-d3*z) y = (b2*d3*z + b3*d4*z + b4*d5*z)

Note that di are filter states.
d3 * z is the next d3 value (it is apparently the d2 variable in your code)
Then you can remove di to write the transfer function y / u to z.
Then you will find that the minimal representation requires only e states by factorizing / simplifying the above transfer function.
The denominator z*(za)^3 , that is, the pole at the point 0, and the other in with multiplicity (e-1).

Then you can put your filter in the standard representation of the state space matrix:

 z*X = A*X + B*u y = C*X + d*u

With a special shape of the poles, you can decompose the transfer in a partial decomposition of the shares and get the matrices A and B in this special form (matlab as a designation)

 A = [0 1 0 0; B=[0; 0 a 1 0; 0; 0 0 a 1; 0; 0 0 0 a] 1]

C and d are a little easier than ... They are extracted from the numerators and the direct expression of the partial expansion of the fraction. These are polynomials in bi, c (degree 1) and a (degree e)
For e = 4 we have

 C=[ a^3*b2 - a^2*b3 + a*b4 , -a^2*b2 + a*b3 + (ca^2)*b4 , a*b2 + (ca^2)*b3 + (-2*a^2*c-2*a^2-a+a^4)*b4 , (ca^2)*b2 + (-a^2-a^2*ca)*b3 + (-2*a*c+2*a^3)*b4 ] d= -a*b2 - a*c*b3 + a^2*b4

If you can find recurrence in controlling C and d and precommute them
then the filter can be reduced to those simple vector operations:

 z*X = a*[ 0; x2 ; x3 ; x4 ... xe ] + [x2 ; x3 ; x4 ... xe ; u ]; Y = C*[ x1 ; x2 ; x3 ; x4 ... xe ] + d*u

Or expressed as a function (Xnext,y)=filter(X,u,a,C,d,e) pseudocode:

 y = dot_product( C , X) + d*u; // (like BLAS _DOT) Xnext(1:e-1) = X(2:e); // this is a memcopy (like BLAS _COPY) Xnext(e)=u; X(1)=0; Xnext=a*X+Xnext; // this is an inplace vector muladd (like BLAS _AXPY) X=Xnext; // another memcopy outside the function (can be moved inside).

Please note that if you use BLAS features, your code will be ported to many hardware, not just Applecentric, and I think the performance will not be much different.

EDIT: About Partial Expansion of the Fraction

A pure expansion of the partial fraction would give an idea of the diagonal state space and the matrix B, complete 1s. This is also an interesting option. (filters in parallel)
My version used above is more like a cascade or ladder (filters in the series).

Code or Compiler: Optimizing the IIR Filter in C for iPhone 4 and Later

More articles: