How to stop GCC from violating my NEON properties?

I need to write optimized NEON code for a project, and I am very happy to write assembly language, but for portability / maintainability I use NEON instrinsics. This code should be as fast as possible, so I use my experience in optimizing ARM to correctly alternate instructions and avoid pipe kiosks. No matter what I do, GCC works against me and creates slower code full of stalls.

Does anyone know how to get GCC to go astray and just translate my insides into code?

Here is an example: I have a simple loop that cancels and copies floating point values. It works with 4 sets of 4 at a time to give some time for memory to load and instructions to execute. There were many registries left, so he had no reason to distort things so much.

float32x4_t f32_0, f32_1, f32_2, f32_3; int x; for (x=0; x<n-15; x+=16) { f32_0 = vld1q_f32(&s[x]); f32_1 = vld1q_f32(&s[x+4]); f32_2 = vld1q_f32(&s[x+8]); f32_3 = vld1q_f32(&s[x+12]); __builtin_prefetch(&s[x+64]); f32_0 = vnegq_f32(f32_0); f32_1 = vnegq_f32(f32_1); f32_2 = vnegq_f32(f32_2); f32_3 = vnegq_f32(f32_3); vst1q_f32(&d[x], f32_0); vst1q_f32(&d[x+4], f32_1); vst1q_f32(&d[x+8], f32_2); vst1q_f32(&d[x+12], f32_3); } 

This is the code that it generates:

 vld1.32 {d18-d19}, [r5] vneg.f32 q9,q9 <-- GCC intentionally causes stalls add r7,r7,#16 vld1.32 {d22-d23}, [r8] add r5,r1,r4 vneg.f32 q11,q11 <-- all of my interleaving is undone (why?!!?) add r8,r3,#256 vld1.32 {d20-d21}, [r10] add r4,r1,r3 vneg.f32 q10,q10 add lr,r1,lr vld1.32 {d16-d17}, [r9] add ip,r1,ip vneg.f32 q8,q8 

Additional Information:

  • GCC 4.9.2 for Raspbian
  • compiler flags: -c -fPIE -march=armv7-a -Wall -O3 -mfloat-abi=hard -mfpu=neon

When I write a loop in ASM code matched as my built-in ones (without even using additional src / dest registers to get some free ARM loops), it is still faster than GCC code.

Update: I appreciate James's answer, but in the scheme of things, it really doesn't help with the problem. The simplest of my features has been slightly improved with the cortex-a7 option, but most have not changed. The sad truth is that the GCC optimization of the built-in functions is small. When I worked with the Microsoft ARM compiler several years ago, it consistently generated well-crafted output for embedded NEONs, while GCC stumbled sequentially. With GCC 4.9.x nothing has changed. I certainly appreciate the FOSS nature of GCC and the great work of GNU, but I do not deny that it does not do such work as Intel, Microsoft, or even ARM compilers.

+5
source share
1 answer

In general, the optimization class that you see here is known as "instruction scheduling." GCC uses a team schedule to try to build a better schedule for instructions in each base unit of your program. Here, the β€œschedule” refers to any correct order of instructions in the block, and the β€œbest” schedule can be one that avoids kiosks and other pipeline threats, or one that reduces the range of variables in real time (which leads to a better allocation of registers) or any other purpose of ordering according to instructions.

To avoid barriers due to dangers, GCC uses the processor pipeline model that you are aiming for (see here for details on the specification language used for them, and here for an example pipeline model). This model provides some guidance on GCC scheduling algorithms for processor function blocks, and instruction execution characteristics for these function blocks. GCC can then plan instructions to minimize structural hazards due to multiple instructions requiring the same processor resources.

Without the -mcpu or -mtune (for the compiler) or the --with-cpu or --with-tune parameter (for the compiler configuration), GCC for ARM or AArch64 will try to use the representative model to revise the architecture you are targeting. In this case, -march=armv7-a makes the compiler try to schedule the instructions as if -mtune=cortex-a8 were passed on the command line.

What you see in your output is GCC’s attempt to convert your entry into a schedule that it expects to perform well on Cortex-A8, and work well on processors that implement the ARMv7-A architecture.

To improve this, you can try:

  • Explicit configuration of the processor you are targeting ( -mcpu=cortex-a7 )
  • Disabling command scheduling completely (`-fno-schedule-insns -fno-schedule-insns2)

Note that disabling command scheduling completely can lead to problems elsewhere, since GCC will no longer try to reduce the risks associated with the pipeline in your code.

Change As for your editing, performance errors in GCC can be reported to GCC Bugzilla (see https://gcc.gnu.org/bugs/ ) in the same way that there can be correctness errors. Naturally, with all optimizations, some degree of heuristics is involved, and the compiler may not be able to defeat an experienced build programmer, but if the compiler does something particularly egregious, it is worth highlighting.

+8
source

Source: https://habr.com/ru/post/1241028/


All Articles