Problems with Qualcomm Scorpion Dual Core ARM NEON?

I am developing my own library for Android, where I use ARM assembly optimization and multithreading to get maximum performance on the dual-core ARM MSM8660 chipset. While making some measurements, I noticed the following:

  • a single-threaded library with NEON optimization is faster than a single-threaded library with ARMv6 (as expected).
  • a multi-threaded library with ARMv6 optimization is faster than a single-threaded library with ARMv6 (as expected).
  • a multi-threaded library with NEON optimization is slower than a single-threaded library with NEON (definitely not expected!).

I tried searching all over the network to explain why this is so, but still not found. It seems that all cores have the same NEON pipeline or something like that, but all the circuits seem to indicate that each core should have its own NEON block. Does anyone know why this is happening?

+6
source share
3 answers

First, which library are you using?

You are right, each core has its own NEON block, but its own VeNum unit and information about it has little information. It was developed for Scorpion based on the Cortex-A8 at 8x50 and was pretty much better than ARM's own implementation of NEON SIMD. However, a good relief is that they (qcom) design their equipment in such a way that it is compatible with the basic refrence design, so most of the codes for the cerebral cortex A8 work fine with Scorpion, although some performance got due to the possible different synchronization times .

If you use softfp to compile your program, you will have about 20 loop overhead for each function you call that uses floating point arguments and uses the NEON module as a transfer of register data from the ARM core to the Neon block and on the contrary, it is quite slow and can sometimes delay the core for many cycles, waiting for the pipeline to be hidden.

In addition, for a streaming program using a floating point module, the kernel must save FP registers during the context switch in order to impose an additional penalty on threads, since we already know that moving registers from neon to hand are slow and, as you know, pipelines.

In addition, many other factors can lead to such an error, for example, poor optimization from the compiler, cache skipping, and not to use the double scorpion error function, poor compilation of instructions, and multiple switching of the thread from one core to another.

+1
source

Probably due to cache flaws. It is hard to say without additional information.

0
source

I guess this is due to the extra cycle associated with flushing the NEON pipeline. The NEON pipeline is located behind the rest of the kernel, so you see an additional penalty for missing branches and so on.

If threads need to sync quite often, or if you have a lot of locks, I think you'll see big fines with NEON.

The only way that you are going to use NEON for the overall performance gain with multi-threaded code is if the code is awkwardly parallel and there is very little and infrequent communication between threads.

0
source

Source: https://habr.com/ru/post/898280/


All Articles