I need to improve the loop because it is called by my application thousands of times. I suppose I need to do this with Neon, but I don't know where to start.
Assumptions / Prerequisites:
w always 320 (a multiple of 16/32).pa and pb are 16 byte alignedma and mb are positive.
int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w) { int sum=0; do { sum += ((*pa++)-ma)*((*pb++)-mb); } while(--w); return sum; }
This attempt to vectorize it does not work well and is unsafe (clobbers is missing), but it demonstrates what I'm trying to do:
int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w) { asm volatile("lsr %2, %2, #3 \n" ".loop: \n" "# load 8 elements: \n" "vld4.8 {d0-d3}, [%1]! \n" "vld4.8 {d4-d7}, [%2]! \n" "# do the operation: \n" "vaddl.u8 q7, d0, r7 \n" "vaddl.u8 q8, d1, d8 \n" "vmlal.u8 q7, q7, q8 \n" "# Sum the vector a save in sum (this is wrong):\n" "vaddl.u8 q7, d0, r7 \n" "subs %2, %2, #1 \n" // Decrement iteration count "bne .loop \n" // Repeat unil iteration count is not zero : : "r"(pa), "r"(pb), "r"(w),"r"(ma),"r"(mb),"r"(sum) : "r4", "r5", "r6","r7","r8","r9" ); return sum; }