Unicode literal - how does that make sense?

Question

Unicode literal - how does that make sense?

int main() { std::cout << "\u2654" << std::endl; // Result #1: ♔ std::cout << U'\u2654' << std::endl; // Result #2: 9812 std::cout << U'♔' << std::endl; // Result #3: 9812 return 0; }

I find it hard to understand how Unicode works with C ++. Why does not literally output a literal in the terminal?

I want something like this to work;

 char32_t txt_representation() { return /* Unicode codepoint */; }

Note: the source is UTF-8, as well as a terminal sitting on macOS Sierra, CLion.

+1

c ++ unicode character-encoding

Entalpi Dec 12 '16 at 18:13

source share

4 answers

C ++ does not have the concept of “character” in its type system. char , wchar_t , char16_t and char32_t all considered integer types. As a result, character literals such as 'x' , L'x' , U'x' are all numbers. There is operator<< for char , so

 cout << "endl is almost never necessary" << '\n';

does the same thing as

 cout << "endl is almost never necessary\n";

but there are no analogues for *char_t , therefore your literals with a wide character are silently converted to int and printed as such. I personally never use iostreams, and therefore I really don’t know how to convince operator<< print the number in Unicode code, but there is probably a way to do this.

There is a stronger difference between the "string" and the "array of integers" in the type system, so you get the result expected when you supply the string literal. Please note, however, that cout << L"♔" will not produce the expected result, and cout << "♔" is not even guaranteed to be compiled. cout << u8"♔" will work in a C ++ 11-compatible system, where the narrow character encoding is actually UTF-8, but will probably produce mojibake if the character encoding is something else.

(Yes, it's a lot more complicated and less useful than it has an excuse. This is partly due to backward compatibility restrictions inherited from C, partly because everything was developed back in the 1990s before Unicode took over the world , and partly because many design errors in C ++ strings and stream classes were not obvious as errors until it was too late to fix them.)

+7

zwol Dec 12 '16 at 18:25

source share

Printing large characters for narrow streams is not supported and does not work at all. (It "works", but the result is not the one you want).

Printing multibyte narrow lines to wide streams is not supported and does not work at all. (It "works", but the result is not the one you want).

On a Unicode-enabled system, std::cout << "\u2654" works as expected. So std::cout << u8"\u2654" . The most correctly configured Unix-based operating systems are ready for use in Unicode.

On a Unicode-enabled system, std::wcout << L'\u2654' works properly if you correctly configured the localization of your program. This is done with this call:

  ::setlocale(LC_ALL, "");

or

  ::std::locale::global(::std::locale(""));

Note "must"; with some compilers / libraries this method may not work at all. This is a flaw in these compilers / libraries. I am looking at you, lib ++. This may or may not be a formal mistake, but I see it as a mistake.

You really need to customize your language in all programs that want to work with Unicode, even if it does not seem necessary.

Mixing cout and wcout in one program does not work and is not supported.

std::wcout << U'\u2654' does not work because it mixes the wchar_t stream with char32_t character. wchar_t and char32_t are different types. I assume that a properly configured std::basic_stream<char32_t> will work with char32_t strings, a bit that the standard library does not provide.

Rows based

char32_t are good for storing and processing Unicode code points. Do not use them for formatted input and output directly. std :: wstring_convert can be used to convert them back and forth.

TL DRs work with std::stream and std::string s, or (if you are not in lib ++) std::wstream and std::wstring s.

+2

nm Dec 12 '16 at 19:07

source share

On my system, I cannot mix using std::cout with std::wcout and get reasonable results. Therefore, you must do this separately.

You must set the locale according to the source system using std::locale::global(std::locale("")); .

Also use wide streams for the second two outputs

Or:

 std::locale::global(std::locale("")); std::cout << "\u2654" << std::endl;

Or:

 std::locale::global(std::locale("")); std::wcout << L"\u2654" << std::endl; std::wcout << L'♔' << std::endl;

This should stimulate output streams for conversion between local system encoding and utf8 (1st example) or ucs16/utf32 (second example).

I think that with the first example it might be safer (editors may have other encodings), the u8 line prefix is u8 :

 std::cout << u8"\u2654" << std::endl;

+1

Galik Dec 12 '16 at 19:06

source share

Christophe · Accepted Answer · 2016-12-12T19:33:53+0000

Unicode and C ++

There are several Unicode encodings:

UTF-8 encodes each Unicode character in a sequence of one to four (8 bits) bytes ( char )
UTF-16 (which may be BE and LE depending on the entity) encodes each Unicode character into a sequence of one or two 16 bits of a word ( char16_t ).
UTF-32 (again BE or LE) encodes each Unicode character into one 32-bit word ( char32_t ).

Here's a great video tutorial on Unicode with C ++ by James McNellis. It explains everything you need to know about character set encoding, in Unicode and its different encodings, as well as how to use it in C ++.

Your code

"\u2654" is a narrow string literal that has an array of char types. the white character in white chess will be encoded as 3 consecutive characters corresponding to the UTF-8 encoding ( { 0xe2, 0x99, 0x94 } ). Since we are in a line, there are no problems with the presence of several characters in it. Since UTF8 is definitely used in your console locale, it will interpret the decode sequence correctly when a string is displayed.

U'\u2654' is a character literal of type char32_t (due to uppercase U). Since this is char32_t (not char), it is not displayed as char, but as an integer value. The decimal value is 9812. If you use hex, you would immediately recognize it.

The last U'♔' obeys the same logic. Remember, however, that you are inserting a Unicode character into the source code. This is fine as long as the character encoding of the editor matches the source encoding expected by the compiler. But this can lead to inconsistencies if the file is copied (without conversion) to environments that expect different encodings.

Unicode literal - how does that make sense?

More articles: