There is no Unicode character that can be stored in one encoding, but not another. This is simply because valid Unicode characters were limited to what could be stored in UTF-16 (which has the least capacity of three encodings). In other words, UTF-8 and UTF-16 can be used to represent a wider range of characters than UTF-16, but this is not so. Read on for more details.
Utf-8
UTF-8 is a variable length code. Some characters require 1 byte, some require 2, some 3 and some 4. Bytes for each character are simply written one after the other as a continuous stream of bytes.
Although some UTF-8 characters may be 4 bytes long, UTF-8 cannot encode 2 ^ 32 characters. It doesn't even close. I will try to explain the reasons for this.
The software that reads the UTF-8 stream simply receives a sequence of bytes - how it should decide if the next 4 bytes will be a single 4-byte character or two 2-byte characters or four 1-byte characters (or some other combination) ? This is mainly done by deciding that some 1-byte sequences are not valid characters, and some 2-byte sequences are not valid characters, etc. When these invalid sequences appear, it is assumed that they are part of a longer sequence.
You saw a completely different example of this, I am sure: it caused a screening. In many programming languages, it was decided that the \ character in the source code of a string does not translate to any valid character in the form of "compiled". When a \ is in the source, it is assumed that it is part of a longer sequence, such as \n or \xFF . Note that \x is an invalid 2-character sequence, and \xF is an invalid 3-character sequence, but \xFF is a valid 4-character sequence.
Basically, there is a trade-off between having multiple characters and shorter characters. If you need 2 ^ 32 characters, they should be an average of 4 bytes. If you want all your characters to be 2 bytes or less, you cannot have more than 2 ^ 16 characters. UTF-8 gives a reasonable compromise: all ASCII characters (ASCII from 0 to 127) receive 1-byte representations, which is great for compatibility, but many more characters are allowed.
Like most variable-length encodings, including the escape sequence types shown above, UTF-8 is instant code . This means that the decoder simply reads the bytes by bytes, and as soon as it reaches the last byte of the character, it knows what a character is (and it knows that this is not the beginning of a longer character).
For example, the character βAβ is represented using byte 65, and there are no two / three / four byte characters whose first byte is 65. Otherwise, the decoder will not be able to distinguish these characters from βAβ, followed by something else.
But UTF-8 is even more limited. This ensures that the encoding of the shorter character never appears anywhere in the encoding of the longer character. For example, none of the bytes in a 4-byte character can be 65.
Since UTF-8 has 128 different 1-byte characters (whose byte values ββare 0-127), all 2, 3, and 4-byte characters should only consist of bytes in the range 128-256. This is a big limitation. However, it allows byte-oriented string functions to work with little or no modification. For example, the C function strstr() always works as expected if its inputs are valid for UTF-8 strings.
Utf-16
UTF-16 is also a variable-length code; its characters consume either 2 or 4 bytes. 2-byte values ββin the range 0xD800-0xDFFF are reserved for constructing 4-byte characters, and all 4-byte characters consist of two bytes in the range 0xD800-0xDBFF, followed by 2 bytes in the range 0xDC00-0xDFFF. For this reason, Unicode does not assign any characters in the range U + D800-U + DFFF.
Utf-32
UTF-32 is a fixed-length code, each of which has a length of 4 bytes. Although this allows you to encode 2 ^ 32 different characters, in this scheme only values ββfrom 0 to 0x10FFFF are allowed.
Performance Comparison:
- UTF-8: 2,097,152 (actually 2,166,912, but due to design details, some of them refer to the same thing)
- UTF-16: 1,112,064
- UTF-32: 4,294,967,296 (but limited to the first 1,114,112)
Thus, the most limited is UTF-16! The formal definition of Unicode limited Unicode characters to those that can be encoded using UTF-16 (i.e., the range U + 0000 to U + 10FFFF, with the exception of U + D800, U + DFFF). UTF-8 and UTF-32 support all of these characters.
The UTF-8 system is actually "artificially" limited to 4 bytes. It can be expanded to 8 bytes without violating the restrictions described earlier, and this will give a throughput of 2 ^ 42. The original UTF-8 specification actually allowed up to 6 bytes, which gives a capacity of 2 ^ 31. But RFC 3629 limited it to 4 bytes. as it is necessary to cover everything that UTF-16 does.
There are other (mostly historical) Unicode encoding schemes, in particular UCS-2 (which is capable of encoding U + 0000 to U + FFFF).