What does Unicode encoding (UTF-8, UTF-16, other) use Windows for its Unicode data types?

There are different encodings of the same Unicode (standardized) table . For example, for UTF-8 encoding, A corresponds to 0x0041 , but for UTF-16 encoding, the same A is equal to represented as 0xfeff0041 .

From this brilliant article, I learned that when I program on the C ++ platform for Windows, and I deal with Unicode, I should know that it is represented in 2 bytes. But he says nothing about coding. (Even it says that x86 processors are insignificant, so I know how these two bytes are stored in memory.) But I also need to know the Unicode encoding so that I have full information about how characters are stored in memory. Is there any fixed Unicode encoding for C ++ / Windows programmers?

+4
source share
1 answer

The values ​​stored in memory for Windows are always UTF-16 little-endian. But this is not what you are talking about - you are looking at the contents of the file. Windows itself does not specify the encoding of files, this leaves it to individual applications.

0xfe 0xff, which you see at the beginning of the file, is "Evaluation or specification of the byte . " This not only indicates that the file is most likely Unicode, but it tells you which Unicode encoding option.

 0xfe 0xff UTF-16 big-endian 0xff 0xfe UTF-16 little-endian 0xef 0xbb 0xbf UTF-8 

A file that does not have a specification should be considered an 8-bit character if you do not know how it was written. This still does not tell you whether it is UTF-8 or some other Windows character encoding, you just need to guess.

You can use Notepad as an example of how this is done. If the file has a specification, then Notepad will read it and process the contents accordingly. Otherwise, you must specify the encoding yourself with the "Encoding" drop-down list.

Edit: The reason that the Windows documentation is no more specific about the encoding is because Windows was a very early Unicode sequence, and at that time there was only one 16-bit encoding per code point . When 65,536 code points were defined as inadequate, surrogate pairs were invented as a way to expand the range, and UTF-16 appeared. Microsoft has already used Unicode to indicate their encoding and has never changed.

+12
source

Source: https://habr.com/ru/post/1447571/


All Articles