I guess I really mean, "Why would anyone offer UTF-16 treatment as fixed encoding when it seems fake?"
Two words: Backward compatible.
Unicode was originally intended to use 16-bit fixed-width encoding (UCS-2), so early Unicode followers (like Sun with Java and Microsoft with Windows NT) used a 16-bit character type.When it turned out that there were 65,536 characters for all UTF-16 was designed to allow these 16-bit character systems to represent 16 new “planes”.
This meant that the characters were no longer fixed, so people created a rationalization that “this is normal because UTF-16 is almost a fixed width”.
But I'm still not convinced that this is any other, assuming UTF-8 is single-byte characters!
Strictly speaking, this is no different. You will get incorrect results for things like
"\uD801\uDC00".lower() .
However, assuming that a fixed width of UTF-16 is less likely than is assumed that UTF-8 is a fixed width. Non-ASCII characters are very common in languages other than English, but non-BMP characters are very rare.
You can use the same special processing that you use to combine characters; also handle surrogate pairs in UTF-16
I do not know what he is talking about. The combination of sequences, the constituent characters of which have an individual identity, is no different from surrogate characters, which make sense only in pairs.
In particular, characters in a combining sequence can be converted to another encoding form one character at a time.
>>> 'a'.encode('UTF-8') + '\u0301'.encode('UTF-8') b'a\xcc\x81'
But not surrogates:
>>> '\uD801'.encode('UTF-8') + '\uDC00'.encode('UTF-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed