First, which library are you using?
You are right, each core has its own NEON block, but its own VeNum unit and information about it has little information. It was developed for Scorpion based on the Cortex-A8 at 8x50 and was pretty much better than ARM's own implementation of NEON SIMD. However, a good relief is that they (qcom) design their equipment in such a way that it is compatible with the basic refrence design, so most of the codes for the cerebral cortex A8 work fine with Scorpion, although some performance got due to the possible different synchronization times .
If you use softfp to compile your program, you will have about 20 loop overhead for each function you call that uses floating point arguments and uses the NEON module as a transfer of register data from the ARM core to the Neon block and on the contrary, it is quite slow and can sometimes delay the core for many cycles, waiting for the pipeline to be hidden.
In addition, for a streaming program using a floating point module, the kernel must save FP registers during the context switch in order to impose an additional penalty on threads, since we already know that moving registers from neon to hand are slow and, as you know, pipelines.
In addition, many other factors can lead to such an error, for example, poor optimization from the compiler, cache skipping, and not to use the double scorpion error function, poor compilation of instructions, and multiple switching of the thread from one core to another.
source share