What problems will arise when processing UTF-16 as a fixed 16-bit encoding?

I read a few questions about SO about Unicode, and there were some comments that I didn't quite understand, for example:

Dean Harding : UTF-8 is variable-length encoding, which is a more complex process than fixed-length encoding. Also, see my comment on the Gumbo answer: basically, character combinations exist in all encodings (UTF-8, UTF-16 and UTF-32) and they require special handling. You can use the same special processing that you use to combine characters to also process surrogate pairs in UTF-16, so for the most part you can ignore surrogates and UTF-16 treatments as fixed encoding.

I was a little embarrassed by the last part ("for the most part"). If UTF-16 is considered as fixed 16-bit encoding, what problems can arise? What are the chances of having characters outside the BMP? If so, what problems can arise if you assume double-byte characters?

I read the Wikipedia information about Surrogates , but that did not make me more understandable!

Edit: I suppose I really mean: "Why does anyone suggest treating UTF-16 as a fixed encoding when it seems fictitious?"

Edit2:

I found another comment in the section Is there any reason to prefer UTF-16 over UTF-8? which, in my opinion, explains this a bit better:

Andrew Russell : for performance: UTF-8 is much more difficult to decode than UTF-16. In UTF-16 characters, either a basic multilingual aircraft symbol (2 bytes) or a surrogate pair (4 bytes). UTF-8 characters can be anywhere between 1 and 4 bytes

This suggests that the point of UTF-16 will not contain three-byte characters, so assuming 16 bits, you will not "completely corrupt", ending with one byte. But I'm still not convinced that this is completely different, assuming UTF-8 is a single-byte character!

+4
source share
4 answers

UTF-16 includes all characters of the base plane . BMP covers most modern recording systems and includes many old characters that you can almost come across. Take a look at them and decide if you really come across any characters from extended planes: cuneiform, alchemical symbols, etc. Few people really miss them.

If you are still faced with characters that require extended planes, they are encoded with two code points (surrogates), and you will see two empty squares or question marks instead of such an unsymbol. UTF is self-synchronizing, so part of a surrogate character never looks like a legit character. This allows you to work with ordinary strings, even if surrogates are present, and you do not process them.

Thus, the problems encountered when processing UTF-16 as efficiently as USC-2 are minimal, except that you do not handle extended characters.

EDIT: Unicode uses “combining marks” that are rendered in the space of the previous character, such as accents, tilde, stroke, etc. Sometimes a combination of a diacritical mark with a letter can be represented as a separate code point, for example. á can be represented as a single \u00e1 instead of a simple "a" + accent, which is equal to \u0061\u0301 . However, you cannot represent unusual combinations, such as z, as one code. This makes search and splitting algorithms more complex. If you somehow make your string data homogeneous (for example, only using simple letters and combination of labels), the search and separation will become simple again, but in any case you will lose the property “one position - one character”. A symmetric problem arises if you are serious about the set and want to explicitly store ligatures such as fi or ffl, where one point in the code corresponds to 2 or 3 characters. This is not a UTF problem, it is a Unicode problem in general, AFAICT.

+3
source

It is important to understand that even UTF-32 has a fixed length when it comes to code points, not characters. There are many characters consisting of several code points, and therefore you cannot have Unicode encoding, where one number (unit of code) corresponds to one character (as perceived by users).

To answer your question - the most obvious problem with processing UTF-16 in a fixed-length form is to split a line in the middle of a surrogate pair so that you get two incorrect code points. It all really depends on what you do with the text.

+3
source

I guess I really mean, "Why would anyone offer UTF-16 treatment as fixed encoding when it seems fake?"

Two words: Backward compatible.

Unicode was originally intended to use 16-bit fixed-width encoding (UCS-2), so early Unicode followers (like Sun with Java and Microsoft with Windows NT) used a 16-bit character type.When it turned out that there were 65,536 characters for all UTF-16 was designed to allow these 16-bit character systems to represent 16 new “planes”.

This meant that the characters were no longer fixed, so people created a rationalization that “this is normal because UTF-16 is almost a fixed width”.

But I'm still not convinced that this is any other, assuming UTF-8 is single-byte characters!

Strictly speaking, this is no different. You will get incorrect results for things like "\uD801\uDC00".lower() .

However, assuming that a fixed width of UTF-16 is less likely than is assumed that UTF-8 is a fixed width. Non-ASCII characters are very common in languages ​​other than English, but non-BMP characters are very rare.

You can use the same special processing that you use to combine characters; also handle surrogate pairs in UTF-16

I do not know what he is talking about. The combination of sequences, the constituent characters of which have an individual identity, is no different from surrogate characters, which make sense only in pairs.

In particular, characters in a combining sequence can be converted to another encoding form one character at a time.

 >>> 'a'.encode('UTF-8') + '\u0301'.encode('UTF-8') b'a\xcc\x81' 

But not surrogates:

 >>> '\uD801'.encode('UTF-8') + '\uDC00'.encode('UTF-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed 
+2
source

UTF-16 is a variable length encoding. Older UCS-2 - no. If you handle variable length encodings, such as fixed (constant length), you run the risk of introducing an error whenever you use the “number of 16-bit numbers” to mean “number of characters,” since the number of characters can actually be less than 16-bit.

0
source

Source: https://habr.com/ru/post/1340621/


All Articles