Is it a good idea to compile a language in C?

All over the world, I get the feeling that writing a C compiler for a compiler is no longer such a good idea. GHC C backend is no longer actively developing (this is my unsupported feeling). Compilers target C- or LLVM.

Generally, I would think that GCC is a good old mature compiler that performs code optimization well, so compiling in C will use GCC maturity to get better and faster code. This is not true?

I understand that the question largely depends on the nature of the compiled language and other factors, such as getting more convenient code. I am looking for a fairly general answer (wrt compiled language) that focuses solely on performance (without regard to the quality of the code, .. etc.). I would also be very happy if the answer included an explanation of why GHC drifts from C and why LLVM works better as a backend ( see This ) or any other compiler examples doing the same that I don't know.

+42
c gcc compiler-construction ghc
Jan 23 '12 at 18:08
source share
10 answers

Although I am not a specialist in the compiler, I believe that it comes down to the fact that you lose something in the translation into C, and not into the translation, for example. LLVM.

If you are thinking about the compilation process in C, you create a compiler that translates into C code, then the C compiler translates into an intermediate representation (AST-in-memory AST), and then translates it into machine code. The creators of the C compiler probably spent a lot of time optimizing certain human samples in the language, but you are unlikely to be able to create an attractive enough compiler from the original C language to mimic how people write code. There is a loss of fidelity going to C - the C compiler does not know about your original code structure. To get these optimizations, you basically tweak your compiler to try and generate C code that the C compiler knows how to optimize when it creates its AST. Dirty

If, however, you translate directly into the intermediate LLVM language, which is like compiling your code into a high-level machine-independent bytecode that is akin to the C compiler giving you access to specify exactly what its AST should contain. Essentially, you cut out a middleman who parses the C code and goes directly to the high-level view, which preserves more of your code's characteristics, requiring less translation.

Also related to performance, LLVM can do some really complex things for dynamic languages, such as creating binary code at runtime. This is the “cool” part of compiling just-in-time: it writes binary code that must be executed at runtime, rather than being stuck with what was created at compile time.

+23
Jan 23 '12 at 18:16
source share

Let me list two of my biggest problems with compiling C. If this is a problem for your language, it depends on what features you have.

  • Garbage collection . When you have garbage collection, you may need to interrupt regular execution at almost any point in the program, and at this point you need to access all the pointers that point to the heap. If you compile C, you do not know where these pointers can be. C is responsible for local variables, arguments, etc. Pointers are probably on the stack (or possibly in other SPARC log windows), but there is no real access to the stack. And even if you scan the stack, what values ​​are pointers? LLVM really solves this problem (I think I do not know how well, since I have never used LLVM with GC).

  • Tail calls . Many languages ​​assume that tail calls work (i.e. that they do not expand the stack); The scheme empowers him, Haskell believes. This does not apply to C. Under certain circumstances, you can convince some C compilers to make tail calls. But you want tail calls to be reliable, for example, when the tail calls an unknown function. There are clumsy workarounds such as a trampoline, but nothing is satisfactory.

+27
Jan 24 2018-12-12T00
source share

Part of the reason GHC was moving away from the old C backend was because the code generated by GHC was not gcc code that could be particularly well optimized. Thus, with the GHC code generator improving, there was less return for a lot of work. As of 6.12, the NCG code was only slower than the C-compiled code in very few cases, so when the NCG got even better in ghc-7, there was not enough incentive to keep the gcc backend alive. LLVM is a better target because it is more modular, and many optimizations can be made on its intermediate representation before passing the result to it.

On the other hand, the last time I looked, JHC was still producing C and the final binary, usually (exclusively?) Gcc. And JHC binaries tend to be pretty fast.

So, if you can create code, the C compiler does a great job of doing this, it's still a good option, but you probably shouldn't jump over too many hoops to create good C if you can more easily create good executables along a different route.

+8
Jan 23 '12 at 18:24
source share

As you mentioned, C is a good target language, very much dependent on your source language. So, here are a few reasons why C has flaws compared to LLVM or a custom target language:

  • Garbage collection: a language that wants to support efficient garbage collection needs to know additional information that interferes with C. If distribution is not performed, the GC needs to find which values ​​on the stack and in the register are pointers and which are not. Since the register allocator is not under our control, we need to use more expensive methods, such as writing all pointers to a separate stack. This is just one of many problems when trying to support a modern GC on top of C. (Note that LLVM also still has some problems in this area, but I heard that it works.)

  • Function mapping and language optimization: some languages ​​rely on certain optimizations, for example, The scheme is based on tail call optimization. Modern C compilers can do this, but do not guarantee it, which can cause problems if the program relies on it to be correct. Another feature that can be difficult to support at the top of C is the shared routines.

    Most dynamically typed languages ​​also cannot be optimized by C compilers. For example, Cython compiles Python into C, but the generated C uses calls to many generic functions that are unlikely to be optimized even with the latest versions of GCC. Just-in-time compilation ala PyPy / LuaJIT / TraceMonkey / V8 is much more suitable for providing good performance for dynamic languages ​​(due to a much higher implementation effort).

  • Development experience: having a translator or JIT can also give you a much more convenient experience for developers - generating C code, then compiling it and linking it will certainly be slower and less convenient.

However, I still find it prudent to use C as the compilation target for prototyping new languages. Given that LLVM was explicitly designed as a compiler backend, I would only consider C if there are good reasons not to use LLVM. If the source language is very high-level, you will most likely need an earlier transition to a higher level, since LLVM is really very low-level (for example, GHC performs most of its interesting optimizations before creating a call in LLVM). Oh, and if you are prototyping a language, using an interpreter is probably the easiest - just try to avoid functions that are too much dependent on the implementation of the interpreter.

+8
Jan 23 2018-12-23T00:
source share

In addition to all the reasons for the quality of the code generator, there are other problems:

  1. Free C compilers (gcc, clang) are a bit Unix oriented
  2. Supporting more than one compiler (for example, gcc on Unix and MSVC on Windows) requires duplication of effort.
  3. compilers can drag and drop runtime libraries (or even * nix emulations) on Windows, which are painful. Two different C runtimes (such as linux libc and msvcrt), which are based on, complicate your own runtime and its maintenance.
  4. In your project, you will get a large block with an external version, which means switching the main version (for example, changing the distortion can damage your runtime library, ABI changes, such as changing the alignment) may require some work. Note that this applies to the compiler and the external version (part) of the runtime library. And several compilers multiply this. This is not as bad for C as for the backend, although in the case when you directly connect (read: place a bet) to the backend, as if you were the gcc / llvm interface.
  5. In many languages ​​that follow this path, you see that Cisms penetrate the main language. Of course, this will not please you, but you will be tempted :-)
  6. The functionality of a language that does not directly conform to the C standard (for example, nested procedures and other things that require the use of a stack) is complex.
  7. If something is wrong, users will encounter C-level compiler or linker errors that are outside their scope. Analyze them and make them your own painful ones, especially with multiple compilers and -versions

Please note that paragraph 4 also means that you will have to spend time to make everything work when external projects are developing. This is a time that is usually not included in your project, and since the project is more dynamic, multi-platform releases will need many additional release engineering to meet the changes.

So, in short, from what I saw, this step allows you to quickly get started (get a reasonable code generator for free for many architectures), but there are also disadvantages. Most of them are associated with loss of control and poor Windows support for * nix-oriented projects like gcc. (LLVM is too new to talk about the long term, but their rhetoric sounds the same as gcc ten years ago). If the project you are very dependent on follows a certain course (for example, GCC will work very slowly on win64), then you are stuck with it.

First, decide whether you want to have serious non * nix support (OS X is more unixy), or just a Linux compiler with a temporary mingw space for Windows? Many compilers need first-class Windows support.

Secondly, how prepared should the product be? What is the main audience? Is this an open source developer tool that can handle the DIY toolchain, or do you want to target the novice market (like many third-party products like RealBasic)?

Or do you really want to provide a well-designed product for professionals with deep integration and a complete set of tools?

All three are valid guidelines for the compiler project. Ask yourself what your main focus is, and don’t think that more options will be available on time. For example, evaluate the location of the projects that you selected as the front-end for GCC in the early nineties.

Essentially, the Unix path is to expand (maximize platforms)

Complete kits (such as VS and Delphi, the latter that recently also started supporting OS X and in the past supported linux) go deep and try to maximize performance. (specifically supports the Windows platform with deep levels of integration)

Third-party projects are less clear. They go more for self-employed programmers and niche stores. They have fewer resources for developers, but they manage them better.

+7
Jan 24 2018-12-12T00:
source share

One point that has not yet been raised is how close is your language to C? If you compile a fairly low level of the required language, the semantics of C can very closely match the language you are implementing. If so, this is probably a win, because code written in your language will most likely look like code that someone writes in C manually. This was not the case with the Haskell C backend, which is one of the reasons that the C backend is so optimized.

Another point against using a C server is that the semantics of C are actually not as simple as they look . If your language is significantly different from C, using a C server means you will need to keep track of all these unpleasant difficulties and possibly differences between C compilers. It may be easier to use LLVM with its simpler semantics or develop your own backend than keep track of everything this is.

+6
Jan 23 '12 at 18:53
source share

Personally, I would compile C. So you have a universal intermediary language, and you don't have to worry about whether your compiler supports each platform. Using LLVM can lead to some performance improvements (although I would say that the same could probably be achieved if you improve your C code so that it is more optimized), but it will block you only to support LLVM goals , and wait for LLVM to add purpose when you want to support something new, old, different or obscure.

+3
Jan 23 '12 at 18:16
source share

As far as I know, C cannot query or manipulate processor flags.

+2
Feb 02 '12 at 16:27
source share

This answer is a refutation of some points made against C as the target language.

  • Tail Query Optimization

    Any function that can be optimized using the tail is actually equivalent to iteration (this is an iterative process in SICP terminology). In addition, many recursive functions can and should be made tail recursive, for performance reasons, using batteries, etc.

    Thus, in order for your language to guarantee tail call optimization, you will have to discover it and simply not map these functions to normal C functions, but instead create iterations from them.

  • Garbage collection

    It can actually be implemented in C. You can create a runtime system for your language, which consists of some basic abstractions according to the C memory model - for example, using your own memory allocators, constructors, special pointers for objects in the source language, etc. .d.

    For example, instead of using regular C pointers for objects in the source language, a special structure can be created over which the garbage collector algorithm . Objects in your language (more precisely, links) can behave the same as in Java, but in C they can be represented along with meta-information (which you would not have if you worked only with pointers).

    Of course, such a system may have problems with integration with the existing C tool - it depends on your implementation and the compromises that you are ready to make.

  • Invalid operations

    hippietrail noted that C does not have rotation operators (which I assume that it meant circular shift) that are supported by processors. If such operations are available in the instruction set, they can be added using the built-in assembly .

    The interface in this case will have to determine the architecture in which it works, and provide the appropriate fragments. There should also be some kind of reserve in the form of a regular function.

This answer seems to take some basic questions seriously. I would like to see some more justification for why the problems are precisely caused by the semantics of C.

+2
Dec 12 '14 at 22:43
source share

In the particular case, if you are writing a programming language with strong security * or reliability requirements.

First, it will take you many years to know enough of a subset of C so that you know that all the C operations that you decide to use in your compilation are safe and do not cause undefined behavior. Second, you will need to find a C implementation that you can trust (which would mean a tiny trusted code base and probably not be very efficient). Not to mention that you need to find a reliable linker, an OS capable of executing compiled C code, and some basic libraries, all of which must be clearly defined and trusted.

Thus, in this case, you can also use assembly language if you need information about the independence of the machine, some kind of intermediate representation.

* note that the “strong security” here is not at all related to the fact that banks and IT companies claim that

+1
Mar 29 '13 at 17:03
source share



All Articles