Why can printf display non-ASCII characters when using the "C" locale?

Note. I am defining an implementation-defined behavior that is in Microsoft Visual C ++ 2008 (possibly the same in 2005+). OS: simplified Chinese installation of Win7.

This surprises me when I do I / O without ASCII w / printf . For instance.

  // This won't be necessary as it the system default code page. //system("chcp 936"); // NULL to show current locale, which is "C" printf ("%s\n", setlocale(LC_ALL, NULL)); printf ("中\n"); printf ("%s\n", setlocale(LC_ALL, "English")); printf ("中\n"); 

Output:

 Active code page: 936 CEnglish_United States.1252 ?D 

The fingerprint in the debugger shows that "中" encoded in two bytes: 0xD6 , 0xD0 , which is the code point of this character on code page 936, for simplified Chinese. It should not be in the range of the code area "C" locale , which is most likely 0x0 ~ 0x7F .

Question:

Why can it correctly display a character in the "C" locale? So, I assumed that the language is not related to printf ? But then I ask why it can not be displayed anymore when changing to "English" locale, which also differs from 936? Interesting?

Edit:

I redirected standard output to a file and did some tests. It shows that no matter what locale is set, the correct "中" character is saved in the file. This suggests that setlocale() is related to the way the console displays the character, which contradicts my understanding of how it works: printf puts bytes / codes into the input buffer of the console, which interprets these bytes using its own code page (which returns chcp ) .

+4
source share
3 answers

OK For the standard "C" language, CRT assumes that the characters passed to printf do not need any conversion. This has a reason, because ASCII characters almost always fall into the basic character set of the runtime system (common to different Windows code pages). When switching to "English", it is assumed that the input is encoded on code page 1252 and thus tries to convert from "English" to "Chinese", this is the language used by the console. But CRT simply cannot find the character on code page 1252. That is why it displays a question mark.

When redirecting to a file, the CRT is aware of this and will not perform the conversion because the console code page is no longer used. It just goes through the bytes as is. How these bytes are interpreted depends on the program you use (for example, care about the specification or not) when opening the file.

See this MSDN forum page: Why can printf display non-ASCII characters when using the "C" locale?

0
source

936 is a rather complex code page, it allows you to use 2 characters (similar to UTF-8). For example, Cyrillic (866) - does not allow double-byte characters, and its behavior will be the same as "English".

Therefore, when you use the default code page (936), it knows how to handle a character with 2 characters, while "English" deals only with 0x0 ~ 0x7f .

Let me also answer why wprintf(L"中") fails. There is a big difference between the console application and the Windows-window application, they use different code pages. Follow the matches between the console and windows:

 DOS | Windows ------+---------- 850 | 1252 936 | 54936 866 | 1251 

So, if you want to see the correct characters in the console, first use WideCharToMultiByte , which provides the expected conversion to allow the console to work in 936

+3
source

The fact that the C locale produces a string exactly as indicated is not surprising. This is what I would expect. It's amazing that English will do something else.

According to the language documentation on MSDN, the only effect that the locale on printf should have is to define the radix character for numeric values ​​(i.e. decimal point).

I suspect this may be a bug in Microsoft Compiler. Or at least this is undocumented behavior.

For what it's worth, in my compiler (Borland) the locale does not affect the output of these lines. However, this affects radix.

+3
source

Source: https://habr.com/ru/post/1479223/


All Articles