If the compiler source code is the same for many programming interfaces, will the compiled object code be the same for different languages?

I know that a compiler can have many interfaces. Each front code converts code written in a programming language into an internal data structure.

Then, inside this data structure, the compiler makes some optimizations.

Then, the BACK-END of the compiler converts this data structure into assembly code, and then, at the assembly stage, the assembly code is converted to object code.

My question is the following.

Given the fact that any programming language is translated into this internal data structure, is the final code issued by the compiler the same for the same program logic, but for DIFFERENT programming languages?

+5
source share
2 answers

Yes it is possible. But subtle differences between languages ​​can lead to the fact that they will differ from a similar source. It is rare that an interface module will provide background output to exactly the same inputs. It can be optimized for simple functions and, as a rule, will use the same strategies for things. (for example, on x86, how many LEA commands should be used instead of multiplication.)

eg. in C, overflow overflow is undefined behavior, therefore

void foo(int *p, int n) { for (int i = 0; i <= n ; i++) { p[i] = i/4; } } 

can be considered finally completed for all possible n (including INT_MAX ), and for i non-negative.

With an interface for a language in which i++ defined to have 2 complement wrappers (or gcc with -fwrapv -fno-strict-overflow ), i will go from ==INT_MAX to a large negative value, always <= INT_MAX . The compiler will have to make asm, which correctly implements the behavior of the source code even for callers who pass n == INT_MAX , making it an infinite loop, where i can be negative.

But since undefined Behavior is in C and C ++, the compiler can assume that the program does not contain any UB, and thus, no caller can pass INT_MAX . He can assume that i never negative inside a loop and that the loop counter is the same as a int . See Also What Every C Programmer Should Know About Undefined Behavior (clang blog).


A non-negative assumption allows you to implement i / 4 with a simple shift to the right, and not apply the division semantics C of integer division for negative numbers.

 # the p[i] = i/4; part of the inner loop from # gcc -O3 -fno-tree-vectorize mov edx, eax # copy the loop counter sar edx, 2 # i / 4 == i>>2 mov DWORD PTR [rdi+rax*4], edx # store into the array 

Source + asm output in Godbolt compiler explorer .

But if the behavior signed by the wrapper is determined, the signed division by constant takes more instructions, and indexing the array should take into account possible packaging:

 # Again *just* the body of the inner loop, without the loop overhead # gcc -fno-strict-overflow -fwrapv -O3 -fno-tree-vectorize test eax, eax # set flags (including SF) according to i lea edx, [rax+3] # edx = i+3 movsx rcx, eax # sign-extend for use in the addressing mode cmovns edx, eax # copy if !signbit_set(i) sar edx, 2 # i/4 = i>=0 ? i>>2 : (i+3)>>2; mov DWORD PTR [rdi+rcx*4], edx 

The syntax syntax of the array is just sugar for the pointer + integer and does not require the index to be non-negative. Thus, it is permissible for the caller to pass a pointer to the middle of a 4 GB array, which this function should ultimately write. (Endless loops are also questionable, but NVM does.)

As you can see, the tiny difference in language rules required the compiler not to optimize. The differences between language rules are usually greater than the difference between the C ++ ISO and the C ++ flavor set by a signature that g ++ can implement.

In addition, if the “regular” types have a different width or signature in another language, it is very likely that the source code will have different meanings, and in some cases it will matter.

If I used unsigned , then the wrapper would be a certain overflow behavior in C and C ++. But unsigned types are by definition non-negative, so the ability to bypass will not have such an obvious effect on optimizations without a reversal. If the cycle started from zero, then the workaround makes it possible to return to 0 if that matters (for example, x / i is division by zero).

+4
source

Yes, it is possible that code compiled in different languages ​​leads to the same last build.

Same or similar code

For example, if the front-end for two different languages ​​creates the same intermediate code and metadata 1 and the same optimization phases are applied, then the same code must be guaranteed. This is very easy to see in the case of closely related languages ​​such as C and C ++, where the same or similar code often creates identical code.

Here is a trivial example using C code to increment a pointer and C ++ code to increment a link.

Increment in C

A source

 void inc(int* p) { (*p)++; } 

Final assembly

In gcc at -O2

 inc: add DWORD PTR [rdi], 1 ret 

Play with the build here in gcc and clang .

C ++

Similar code, but use a link to a C ++ link instead of passing a pointer.

A source

void inc (int & p) {p ++; }

Assembly

In g++ with -O2

 inc(int&): add DWORD PTR [rdi], 1 ret 

Play with him here on godbolt .

The assembly made in any case was identical, despite the use of different languages ​​and different language functions (links in the case of C ++, which are not available in C ++).

Note that clang is a completely separate toolchain created by code other than gcc using inc rather than add , but the resulting code was consistent between C and C ++.

Different code

More interestingly, even wildly different code in different languages ​​can produce the same last build. Even if the front-end generates very different intermediate code, optimization passes can ultimately reduce both inputs to the same output. Of course, this is not guaranteed for any particular input, and it will vary greatly by compiler and platform.


1 By metadata, I mean everything except the intermediate instructions themselves, which can affect the formation of the code. For example, some languages ​​may allow fewer optimizations, such as memory reordering or other behaviors (Peter points out signatures). It is not clear to me whether all these differences will be encoded directly in the intermediate language, or if there is also metadata associated with each link of the intermediate code that describes a certain semantics, the optimization and back-end phases must be observed.

+1
source

Source: https://habr.com/ru/post/1272045/


All Articles