I need to write optimized NEON code for a project, and I am very happy to write assembly language, but for portability / maintainability I use NEON instrinsics. This code should be as fast as possible, so I use my experience in optimizing ARM to correctly alternate instructions and avoid pipe kiosks. No matter what I do, GCC works against me and creates slower code full of stalls.
Does anyone know how to get GCC to go astray and just translate my insides into code?
Here is an example: I have a simple loop that cancels and copies floating point values. It works with 4 sets of 4 at a time to give some time for memory to load and instructions to execute. There were many registries left, so he had no reason to distort things so much.
float32x4_t f32_0, f32_1, f32_2, f32_3; int x; for (x=0; x<n-15; x+=16) { f32_0 = vld1q_f32(&s[x]); f32_1 = vld1q_f32(&s[x+4]); f32_2 = vld1q_f32(&s[x+8]); f32_3 = vld1q_f32(&s[x+12]); __builtin_prefetch(&s[x+64]); f32_0 = vnegq_f32(f32_0); f32_1 = vnegq_f32(f32_1); f32_2 = vnegq_f32(f32_2); f32_3 = vnegq_f32(f32_3); vst1q_f32(&d[x], f32_0); vst1q_f32(&d[x+4], f32_1); vst1q_f32(&d[x+8], f32_2); vst1q_f32(&d[x+12], f32_3); }
This is the code that it generates:
vld1.32 {d18-d19}, [r5] vneg.f32 q9,q9 <-- GCC intentionally causes stalls add r7,r7,#16 vld1.32 {d22-d23}, [r8] add r5,r1,r4 vneg.f32 q11,q11 <-- all of my interleaving is undone (why?!!?) add r8,r3,#256 vld1.32 {d20-d21}, [r10] add r4,r1,r3 vneg.f32 q10,q10 add lr,r1,lr vld1.32 {d16-d17}, [r9] add ip,r1,ip vneg.f32 q8,q8
Additional Information:
- GCC 4.9.2 for Raspbian
- compiler flags:
-c -fPIE -march=armv7-a -Wall -O3 -mfloat-abi=hard -mfpu=neon
When I write a loop in ASM code matched as my built-in ones (without even using additional src / dest registers to get some free ARM loops), it is still faster than GCC code.
Update: I appreciate James's answer, but in the scheme of things, it really doesn't help with the problem. The simplest of my features has been slightly improved with the cortex-a7 option, but most have not changed. The sad truth is that the GCC optimization of the built-in functions is small. When I worked with the Microsoft ARM compiler several years ago, it consistently generated well-crafted output for embedded NEONs, while GCC stumbled sequentially. With GCC 4.9.x nothing has changed. I certainly appreciate the FOSS nature of GCC and the great work of GNU, but I do not deny that it does not do such work as Intel, Microsoft, or even ARM compilers.