What is the meaning of UTF-16?

I never understood the meaning of UTF-16 encoding. If you need to treat strings as random access (that is, the code point is the same as the code block), then you need UTF-32, since UTF-16 is still a variable length. If you don't need it, the UTF-16 looks like a huge waste of space compared to the UTF-8. What are the advantages of UTF-16 compared to UTF-8 and UTF-32 and why does Windows and Java use it as their own encoding?

+44
utf-8 character-encoding utf-16 utf-32 utf
Mar 13 2018-11-11T00:
source share
5 answers

When Windows NT was developed, UTF-16 did not exist (NT 3.51 was born in 1993, and UTF-16 was born in 1996 using the Unicode 2.0 standard); instead, there was UCS-2, which at that time was sufficient to store every character available in Unicode, so the equivalent of the 1st code = 1 was really right - no variable length logic is required for strings.

Then they moved to UTF-16 to support the entire Unicode character set; however, they could not upgrade to UTF-8 or UTF-32, as this would violate binary compatibility in the API (by the way).

As for Java, I'm not sure; since it was released in 1995, I suspect that UTF-16 was already on the air (even if it had not yet been standardized), but I think that compatibility with NT-based operating systems could play a role in their choice (continuous UTF-8 ↔ UTF-16 conversions for every Windows API call may cause some slowdown).




Edit

Wikipedia explains that even for Java this happened in exactly the same way: it originally supported UCS-2, but switched to UTF-16 in J2SE 5.0.

So, in the general case, when you see UTF-16 used in some APIs / Framework, this is because it started as UCS-2 (to avoid complications in string management algorithms), but it switched to UTF-16 for support code points outside the BMP while maintaining the same code block size.

+35
Mar 13 2018-11-11T00:
source share

None of the answers indicating the advantage of UTF-16 over UTF-8 makes sense, except for the backward compatibility answer.

Well, in my comment there are two caveats.

Erik states: "UTF-16 covers the entire BMP with single units. Therefore, if you do not need rarer characters outside the BMP, UTF-16 is actually 2 bytes per character."

Caution 1)

If you can be sure that your application will never need any character outside of BMP, and that any library code that you write to use with it will NEVER be used with any application that a character outside of BMP will ever need, then you can use UTF-16 and write code that makes an implicit assumption that each character will have exactly two bytes in length.

It seems extremely dangerous (actually, stupid).

It may be one single character outside the BMP, which may at some point have an application or library code, a code assuming that all UTF-16 characters are two bytes long.

Therefore, code that examines or processes UTF-16 must be written to handle the case of a UTF-16 character requiring more than 2 bytes.

Therefore, I “reject” this reservation.

Therefore, UTF-16 is not easier to code than UTF-8 (the code for both must process variable-length characters).

Caution 2)

UTF-16 MAY be more computationally efficient, in some circumstances, if it is properly written.

Similarly: suppose that some long lines are rarely changed, but often checked (or better, never changed after creation, i.e. a string builder that creates non-modifiable lines). A flag can be set for each line, indicating whether the line contains only characters of "fixed length" (i.e. does not contain characters with a length of not more than two bytes). Lines for which the flag is true can be checked with optimized code that assumes fixed lengths (2 bytes).

How about space efficiency?

UTF-16 is obviously more efficient for A) characters for which UTF-16 requires fewer bytes for encoding than UTF-8.

UTF-8 is obviously more efficient for B) characters for which UTF-8 requires fewer bytes for encoding than UTF-16.

With the exception of very “specialized” text, it is likely that the counter (B) is much larger than the number (A).

+12
Jan 05 '14 at 9:11
source share

UTF-16 covers the entire BMP with single units. Therefore, if you do not need rarer characters outside BMP, UTF-16 is actually 2 bytes per character. UTF-32 takes up more space, UTF-8 requires support for variable lengths.

+4
Mar 13 '11 at 20:32
source share

UTF16 is commonly used as a direct match to multibyte character sets, i.e. onyl original 0-0xFFFF assigned characters.

This gives you the best of both worlds, you have a fixed character size, but you can print all the characters that are likely to be used (religious religious scripts of Klingon are excluded)

+1
Mar 13 '11 at 20:32
source share

UTF-16 allows you to display all basic multilingual planes (BMP) as single code units. Unicode codes outside of U + FFFF are represented by surrogate pairs.

Interestingly, Java and Windows (and other systems using UTF-16) operate at the code level, rather than the Unicode code level. Thus, a string consisting of a single character U + 1D122 (MUSICAL SYMBOL F CLEF) is encoded in Java as "\ ud824 \ udd22" and "\ud824\udd22".length() == 2 (not 1 ). So this is kind of a hack, but it turns out that the characters are not variable length.

The advantage of UTF-16 over UTF-8 is that it would be possible to refuse too much if the same hacking were used with UTF-8.

+1
Mar 13 2018-11-11T00:
source share



All Articles