Can Unicode UTF-8, UTF-16, and UTF-32 encodings distinguish between the number of characters they can store?

Good. I know this looks like the typical β€œWhy didn't he just google or go to www.unicode.org and not look?” question, but for such a simple question, the answer still eludes me after checking both sources.

I am sure that all three of these coding systems support all Unicode characters, but I need to confirm this before I make this expression in the presentation.

Bonus question: do these encodings differ in the number of characters that can be extended to support?

+39
unicode
Sep 24 '08 at 22:51
source share
6 answers

No, they are just different coding methods. All of them support encoding of the same character set.

UTF-8 uses one to four bytes per character, depending on which character you encode. Characters in the ASCII range take only one byte, and very unusual characters take four.

UTF-32 uses four bytes per character, no matter what character it has, so it will use more space than UTF-8 to encode the same string. The only advantage is that you can calculate the number of characters in a UTF-32 string by only counting bytes.

UTF-16 uses two bytes for most characters, four bytes for unusual characters.

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

+34
Sep 24 '08 at 23:04
source share

There is no Unicode character that can be stored in one encoding, but not another. This is simply because valid Unicode characters were limited to what could be stored in UTF-16 (which has the least capacity of three encodings). In other words, UTF-8 and UTF-16 can be used to represent a wider range of characters than UTF-16, but this is not so. Read on for more details.

Utf-8

UTF-8 is a variable length code. Some characters require 1 byte, some require 2, some 3 and some 4. Bytes for each character are simply written one after the other as a continuous stream of bytes.

Although some UTF-8 characters may be 4 bytes long, UTF-8 cannot encode 2 ^ 32 characters. It doesn't even close. I will try to explain the reasons for this.

The software that reads the UTF-8 stream simply receives a sequence of bytes - how it should decide if the next 4 bytes will be a single 4-byte character or two 2-byte characters or four 1-byte characters (or some other combination) ? This is mainly done by deciding that some 1-byte sequences are not valid characters, and some 2-byte sequences are not valid characters, etc. When these invalid sequences appear, it is assumed that they are part of a longer sequence.

You saw a completely different example of this, I am sure: it caused a screening. In many programming languages, it was decided that the \ character in the source code of a string does not translate to any valid character in the form of "compiled". When a \ is in the source, it is assumed that it is part of a longer sequence, such as \n or \xFF . Note that \x is an invalid 2-character sequence, and \xF is an invalid 3-character sequence, but \xFF is a valid 4-character sequence.

Basically, there is a trade-off between having multiple characters and shorter characters. If you need 2 ^ 32 characters, they should be an average of 4 bytes. If you want all your characters to be 2 bytes or less, you cannot have more than 2 ^ 16 characters. UTF-8 gives a reasonable compromise: all ASCII characters (ASCII from 0 to 127) receive 1-byte representations, which is great for compatibility, but many more characters are allowed.

Like most variable-length encodings, including the escape sequence types shown above, UTF-8 is instant code . This means that the decoder simply reads the bytes by bytes, and as soon as it reaches the last byte of the character, it knows what a character is (and it knows that this is not the beginning of a longer character).

For example, the character β€œA” is represented using byte 65, and there are no two / three / four byte characters whose first byte is 65. Otherwise, the decoder will not be able to distinguish these characters from β€œA”, followed by something else.

But UTF-8 is even more limited. This ensures that the encoding of the shorter character never appears anywhere in the encoding of the longer character. For example, none of the bytes in a 4-byte character can be 65.

Since UTF-8 has 128 different 1-byte characters (whose byte values ​​are 0-127), all 2, 3, and 4-byte characters should only consist of bytes in the range 128-256. This is a big limitation. However, it allows byte-oriented string functions to work with little or no modification. For example, the C function strstr() always works as expected if its inputs are valid for UTF-8 strings.

Utf-16

UTF-16 is also a variable-length code; its characters consume either 2 or 4 bytes. 2-byte values ​​in the range 0xD800-0xDFFF are reserved for constructing 4-byte characters, and all 4-byte characters consist of two bytes in the range 0xD800-0xDBFF, followed by 2 bytes in the range 0xDC00-0xDFFF. For this reason, Unicode does not assign any characters in the range U + D800-U + DFFF.

Utf-32

UTF-32 is a fixed-length code, each of which has a length of 4 bytes. Although this allows you to encode 2 ^ 32 different characters, in this scheme only values ​​from 0 to 0x10FFFF are allowed.

Performance Comparison:

  • UTF-8: 2,097,152 (actually 2,166,912, but due to design details, some of them refer to the same thing)
  • UTF-16: 1,112,064
  • UTF-32: 4,294,967,296 (but limited to the first 1,114,112)

Thus, the most limited is UTF-16! The formal definition of Unicode limited Unicode characters to those that can be encoded using UTF-16 (i.e., the range U + 0000 to U + 10FFFF, with the exception of U + D800, U + DFFF). UTF-8 and UTF-32 support all of these characters.

The UTF-8 system is actually "artificially" limited to 4 bytes. It can be expanded to 8 bytes without violating the restrictions described earlier, and this will give a throughput of 2 ^ 42. The original UTF-8 specification actually allowed up to 6 bytes, which gives a capacity of 2 ^ 31. But RFC 3629 limited it to 4 bytes. as it is necessary to cover everything that UTF-16 does.

There are other (mostly historical) Unicode encoding schemes, in particular UCS-2 (which is capable of encoding U + 0000 to U + FFFF).

+44
Nov 11 '08 at 6:42
source share

UTF-8, UTF-16, and UTF-32 support the full range of Unicode codes. There are no characters that are supported by one but not the other.

As for the bonus question, "Do these encodings differ in the number of characters that can be expanded to support?" Yes and no. The encoding method of UTF-8 and UTF-16 limits the total number of code points that they can support to less than 2 ^ 32. However, the Unicode Consortium will not add code points to UTF-32 that cannot be represented in UTF-8 or UTF -16. This violates the spirit of coding standards and makes it impossible to guarantee a one-to-one mapping from UTF-32 to UTF-8 (or UTF-16).

+6
Sep 24 '08 at 23:00
source share

I personally always check Joel 's Unicode message, encodings, and character sets when in doubt.

+5
Sep 24 '08 at 22:55
source share

All UTF-8/16/32 encodings can display all Unicode characters. See Wikipedia Comparison of Unicode Encodings .

This IBM article Encode your XML documents in UTF-8 is very useful and indicates if you have a choice, it is best to choose UTF-8. The main reason is the broad support of the tool, and UTF-8 can usually go through systems that are not aware of unicode.

From What the Specifications Say in an IBM article :

Both W3C and IETF have recently become more categorical choosing UTF-8 first, last and sometimes only. W3C Symbol Model for World Wide Web 1.0: The basics say: "When unique character encoding is required, character encoding MUST be UTF-8, UTF-16 or UTF-32. US-ASCII - compatible with UTF-8 (US-ASCII string also is UTF-8 string, see [RFC 3629]), and UTF-8 is therefore advisable if compatible with US-ASCII. " In practice, US-ASCII compatibility is so useful that it is almost a requirement. The W3C wisely explains, β€œIn other situations, for example, an API, UTF-16, or UTF-32 may be more appropriate. Possible reasons for choosing one of them include internal processing efficiency and compatibility with other processes.”

+4
Sep 24 '08 at 23:13
source share

As everyone said, UTF-8, UTF-16, and UTF-32 can encode all Unicode code points. However, the UCS-2 variant (sometimes erroneously called UCS-16) cannot , and this is the one you will find, for example. on Windows XP / Vista strike>.

See Wikipedia for more details.

Edit: I am wrong about Windows, NT was the only one that supported UCS-2. However, many Windows applications will use one word for a code point, as in UCS-2, so you will probably find errors. See another Wikipedia article . (Thanks to JasonTrue)

+2
Sep 25 '08 at 2:18
source share



All Articles