Proper behavior for concatenating string literals (C ++ 11 phase 6 translation)

Question

Proper behavior for concatenating string literals (C ++ 11 phase 6 translation)

I'm pretty sure there is an error in Visual C ++ 2015, but I'm not 100% sure.

the code:

// Encoding: UTF-8 with BOM (required by Visual C++). #include <stdlib.h> auto main() -> int { auto const s = L"" "𐐷 is not in the Unicode BMP!"; return s[0] > 256? EXIT_SUCCESS : EXIT_FAILURE; }

Result with g ++:

  [H: \ scratchpad \ simple_text_io]
 > g ++ --version |  find "++"
 g ++ (i686-win32-dwarf-rev1, Built by MinGW-W64 project) 6.2.0

 [H: \ scratchpad \ simple_text_io]
 > g ++ compiler_bug_demo.cpp

 [H: \ scratchpad \ simple_text_io]
 > run a
 Process exit code = 0.

 [H: \ scratchpad \ simple_text_io]
 > _

Result with Visual C ++:

  [H: \ scratchpad \ simple_text_io]
 > cl / nologo- 2> & 1 |  find "++"
 Microsoft (R) C / C ++ Optimizing Compiler Version 19.00.23026 for x86

 [H: \ scratchpad \ simple_text_io]
 > cl compiler_bug_demo.cpp / Feb
 compiler_bug_demo.cpp
 compiler_bug_demo.cpp (8): warning C4566: character represented by universal-character-name '\ U00010437' cannot be represented in the current code page (1252)

 [H: \ scratchpad \ simple_text_io]
 > run b
 Process exit code = 1.

 [H: \ scratchpad \ simple_text_io]
 > _

Is there any UB, and if not, which compiler behaves correctly?

Addendum:

The behavior does not change for both compilers if you use the lowercase Greek PI, "π", which is in BMP, so it does not matter.

+6

c ++ visual-c ++ g ++

Cheers and hth. - alf Jan 4 '17 at 9:35

source share

1 answer

Revolver_Ocelot · Answer 1 · 2017-01-04T09:54:51+0000

From [lex.string] :

In translation phase 6, adjacent string literals are combined. If both string literals have the same encoding prefix, the resulting concatenated string literal has this encoding prefix. If there is no encoding prefix in one string literal, it is considered as a string literal of the same encoding prefix as the other operand. If the UTF-8 string literal token is adjacent to a wide string literal token, the program is poorly formed. Any other concatenation is conditionally supported with behavior defined by the implementation. [Note. This concatenation is an interpretation, not a transformation. Since the interpretation occurs in translation phase 6 (after each character from the literal has been converted to a value from the corresponding character set), the string literals of the original rawness do not affect the interpretation or the correctness of concatenation. -end note] Table 8 shows some examples of actual concatenations.

Thus, there is no UB, however phase 5 of the translation may already have changed the meaning of some characters:

Each source character sets a member in a character character or string literal , and each escape sequence and the name of a universal character in a character literal or in an uneven string literal is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an element defined by the implementation except for the null (wide) character.

Proper behavior for concatenating string literals (C ++ 11 phase 6 translation)

More articles: