How to enter 4-byte UTF-8 characters?

I am writing a small application that I need to check with utf-8 characters of a different number of bytes.

I can enter unicode characters for verification, which are encoded in utf-8 with 1,2 and 3 bytes, just fine, for example:

string in = "pi = \u3a0"; 

But how do I get a Unicode character that is encoded with 4 bytes? I tried:

 string in = "aegan check mark = \u10102"; 

Which, as I understand it, should output. But when I print it, I get ᴶ0

What am I missing?

EDIT:

I earned by adding leading zeros:

 string in = "\U00010102"; 

I wish I could think about this sooner :)

+4
source share
1 answer

A longer escape form appears in the \U pattern, followed by eight digits, not \U , followed by four digits. It is also used in Java and Python, among others:

 >>> '\xf0\x90\x84\x82'.decode("UTF-8") u'\U00010102' 

However, if you use byte strings, why not just escape every byte as described above, instead of relying on the compiler to convert escape to a UTF-8 string? This may seem more portable - if I compile the following program:

 #include <iostream> #include <string> int main() { std::cout << "narrow: " << std::string("\uFF0E").length() << " utf8: " << std::string("\xEF\xBC\x8E").length() << " wide: " << std::wstring(L"\uFF0E").length() << std::endl; std::cout << "narrow: " << std::string("\U00010102").length() << " utf8: " << std::string("\xF0\x90\x84\x82").length() << " wide: " << std::wstring(L"\U00010102").length() << std::endl; } 

In win32 with my current parameters, cl gives:

warning C4566: character represented by universal-character-name '\UD800DD02' cannot be represented in the current code page (932)

The compiler tries to convert all unicode escape sequences in byte strings to a system code page, which unlike UTF-8 cannot represent all Unicode characters. Oddly enough, he realized that \U00010102 is \uD800\uDD02 in UTF-16 (its internal representation in Unicode) and was looking for escape in the error message ...

At startup, the program prints:

 narrow: 2 utf8: 3 wide: 1 narrow: 2 utf8: 4 wide: 2 

Note that UTF-8 bytes and wide strings are correct, but the compiler could not convert "\U00010102" by specifying the byte string "??" , wrong result.

+5
source

Source: https://habr.com/ru/post/1277763/


All Articles