Why is LLVM execution mechanism faster than compiled code?

I have a compiler that targets LLVM, and I provide two ways to run the code:

  • Run it automatically. This mode compiles the code in LLVM and uses the JIT ExecutionEngine to compile it into machine code on the fly and runs it without creating an output file.
  • Compile it and run it separately. In this mode, the LLVM.bc file is displayed, which I manually optimize (using opt ), compile for native assembly (with llc ) compilation into machine code and a link (using gcc ), and run it.

I expected approach # 2 to be faster than approach # 1, or at least at the same speed, but after going through a few speed tests, I am surprised to find that # 2 sequentially runs about twice as slow. This is a huge speed difference.

In both cases, the same LLVM source code works. With approach No. 1, I have not bothered to run any omissions of LLVM optimization (which is why I expected it to be slower). At approach # 2, I run opt with -std-compile-opts and llc with -O3 to maximize optimization, but it doesn’t come close to No. 1. Here is an example of running the same program:

  • # 1 without optimization: 11.833s
  • # 2 without optimization: 22.262s
  • # 2 with optimization ( -std-compile-opts and -O3 ): 18.823s

Does ExecutionEngine do something special that I don't know about? Is there a way to optimize compiled code to achieve the same performance as the ExecutionEngine JIT?

+43
llvm
May 13 '11 at 7:06 AM
source share
2 answers

For a virtual machine with JIT, working with some applications normally is faster than a compiled application. This is because a virtual machine with JIT is like a simulator that simulates a virtual computer and also runs the compiler in real time. Since both tasks are built into the virtual machine with JIT, a machine simulator can pass information to the compiler so that the code can be recompiled for more efficient work. The information it provides is not available for statically compiled code.

This effect has also been noted using the Java and Python virtual machines PyPy VM, among others.

+29
Jun 01 2018-11-11T00:
source share

Another problem is code alignment and other optimizations. Currently, cpu is so complex that it is difficult to predict which methods will lead to faster execution of the final binary version.

As a real-life example, let’s take a look at the Google Native Client - I mean the original approach to building nacl, not LLVM (because, as far as I know, there is currently support for both "nativeclient" and "LLVM bitcode", ( modyfied)).

As you can see in presentations (look at youtube.com) or in papers , like this Native client: Sandbox for portable, unreliable native x86 code , even their alignment method makes the code size larger, in some cases such alignment of instructions (for example, using noops) gives a better cache.

Aligning commands with noop and reordering commands known in parallel computing, and here it also shows that this also affects.

I hope that this answer gives an idea of ​​how many circumstances can affect the execution of the code rate, and there are many possible reasons for different code fragments, and each of them needs to be investigated. Nevermore, this is an interesting topic, so if you find more information, do not repeat your answer and tell us in the "Post-Scriptorium" that you found more :). (Maybe a link to whitepaper / devblog with new findings :)). Tests are always welcome - see: http://llvm.org/OpenProjects.html#benchmark .

+14
Aug 30 2018-11-21T00:
source share



All Articles