A longer escape form appears in the \U pattern, followed by eight digits, not \U , followed by four digits. It is also used in Java and Python, among others:
>>> '\xf0\x90\x84\x82'.decode("UTF-8") u'\U00010102'
However, if you use byte strings, why not just escape every byte as described above, instead of relying on the compiler to convert escape to a UTF-8 string? This may seem more portable - if I compile the following program:
#include <iostream> #include <string> int main() { std::cout << "narrow: " << std::string("\uFF0E").length() << " utf8: " << std::string("\xEF\xBC\x8E").length() << " wide: " << std::wstring(L"\uFF0E").length() << std::endl; std::cout << "narrow: " << std::string("\U00010102").length() << " utf8: " << std::string("\xF0\x90\x84\x82").length() << " wide: " << std::wstring(L"\U00010102").length() << std::endl; }
In win32 with my current parameters, cl gives:
warning C4566: character represented by universal-character-name '\UD800DD02' cannot be represented in the current code page (932)
The compiler tries to convert all unicode escape sequences in byte strings to a system code page, which unlike UTF-8 cannot represent all Unicode characters. Oddly enough, he realized that \U00010102 is \uD800\uDD02 in UTF-16 (its internal representation in Unicode) and was looking for escape in the error message ...
At startup, the program prints:
narrow: 2 utf8: 3 wide: 1 narrow: 2 utf8: 4 wide: 2
Note that UTF-8 bytes and wide strings are correct, but the compiler could not convert "\U00010102" by specifying the byte string "??" , wrong result.
source share