Why can a Windows console with a Chinese code page show a UTF-16 encoded character?

Per MSDN :

"For the Microsoft C / C ++ compiler, the source and executive character sets are ASCII."

C ++ 03

2.1 Translation Phases

".. Any character of the source file is not in the main character set of the source (2.2) is replaced with the name of the universal character, which means that the character. ( The implementation can use any internal encoding , as long as the actual extended character found in the source file, and that the same extended character expressed in the source file as the name of the generic character (that is, using the notation \ uXXXX) is treated the same way). "

2.13.2 Character Literals

"The universal symbol-name is translated into the encoding, into the executable character set of the specified symbol. If there is such encoding, the universal symbol-name is translated into the implementation-defined.

To check which set of execution characters MSVC ++ uses, I wrote the following code:

wchar_t *str = L"中"; unsigned char *p = reinterpret_cast<unsigned char*>(str); for (int i = 0; i < sizeof(L"中"); ++i) { printf ("%x ", *(p + i)); } 

The result shows that 2d 4e 0 0 , and 0x4e2d is the UTF-16 encoding of this Chinese character. Therefore, I conclude: UTF-16 is used as the MSVC execution character (My version: 2012 4.5.50709)

After that I tried to print this symbol on the Windows console. Since the default locale used by the console is "C" I set the locale code to code page 936, which represents simplified Chinese characters.

 // use the execution environment locale setting, which is 936 wchar_t *str = L"中"; char* locale = setlocale(LC_ALL, ""); wprintf (L"%ls\n", str); 

What outputs:

  

I am wondering how to decode a character encoded in UTF-16 by a Windows console whose language (decoder) is set to non-UTF-16 (MS codepage 936)? How can this happen?

+4
source share
2 answers

I think I get it.

In Microsoft C ++ 2008 (possibly 2005+), CRT functions like wprintf , wcout implemented so that they convert a wide string literal like L"中" encoded in UTF-16 under the hood to match the current language / code page setting. So what happens here is that L"中" converted to D6 D0 bytes D6 D0 on code page 936 for simplified Chinese.

I was mistaken for setlocale set the console code page. It simply sets the current code page of the program, which is used by the CRT functions during the "conversion". To change the code page, the chcp command or the Win API SetConsoleOputputCP() command is chcp .

Since my default console page is 936, this symbol can be correctly displayed without problems.

+2
source

how to decode a character encoded in UTF-16, a Windows console whose language (decoder) is set to non-UTF-16

There are two ways to write text to the console. The byte path using the Win32 API WriteConsoleA gives you characters from bytes interpreted using the console code page ("ANSI"). The Unicode path, WriteConsoleW , receives a UTF-16LE string and writes characters to the console directly, without worrying about which codepage it uses.

The stdio printf function uses WriteConsoleA when the output is an interactive console. The wprintf function, at least from VS 2005, calls WriteConsoleW .

0
source

Source: https://habr.com/ru/post/1478897/


All Articles