How many characters can UTF-8 encode?

If UTF-8 is 8 bits, does this mean that there can only be 256 different characters?

The first 128 code points are the same as in ASCII. But he says UTF-8 can support up to a million characters?

How it works?

+41
utf-8 character-encoding ascii
Apr 19 2018-12-12T00:
source share
9 answers

UTF-8 does not use one byte all the time, it is from 1 to 4 bytes.

The first 128 characters (US-ASCII) need one byte.

The following 1920 characters require two bytes for encoding. This covers the remainder of almost all Latin alphabets, as well as Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tanan alphabets, as well as a combination of diacritics.

The rest of the base multilingual plane requires three bytes, which contain almost all the characters used [12], including most Chinese, Japanese, and Korean characters [CJK].

Symbols in other Unicode planes require four bytes, which include less common CJK characters, various historical scripts, mathematical symbols, and emojis (pictographic symbols).

source: Wikipedia

+70
Apr 19 2018-12-12T00:
source share
β€” -

UTF-8 uses 1-4 bytes per character: one byte for ascii characters (the first 128 unicode values ​​are the same as ascii). But this only requires 7 bits. If the most significant ("sign") bit is set, this indicates the beginning of a multibyte sequence; the number of consecutive sets of high bits indicates the number of bytes, then 0, and the remaining bits contribute to this value. For the remaining bytes, the highest two bits will be 1 and 0, and the remaining 6 bits for the value.

So, a sequence of four bytes will start with 11110 ... (... = three bytes for the value), and then three bytes with 6 bits for the value, which will give a 21-bit value. 2 ^ 21 exceeds the number of characters in Unicode, so all Unicode can be expressed in UTF8.

+30
Apr 19 '12 at 13:40
source share

2017-07-11: Fixed to double count the same code point encoded by multiple bytes

According to this table * UTF-8 must support:

2 7 + 2 11 + 2 16 + 2 21 + 2 26 + 2 31 = 2,216,757,376 characters

2 31 = 2,147,483,648 characters

However, RFC 3629 has limited the possible values, so now we are limited to 4 bytes , which gives us

2 7 + 2 11 + 2 16 + 2 21 = 2,164,864 del characters>

2 21 = 2,097,152 characters

Please note that a good fragment of these characters is β€œreserved” for user use, which is actually very convenient for font icons.

* Used Wikipedia shows a table with 6 bytes - they have since updated the article.

+10
Jul 20 '16 at 18:38
source share

UTF-8 is a variable length encoding of at least 8 bits per character.
Characters with higher code points will occupy up to 32 bits.

+3
Apr 19 '12 at 13:35
source share

Quote from Wikipedia: "UTF-8 encodes each of 1112,064 code points in a Unicode character set using one to four 8-bit bytes (called" octets "in the Unicode standard).

Some links:

+3
Apr 19 '12 at 13:35
source share

2,164,864 "characters" can be potentially encoded by UTF-8.

This number is 2 ^ 7 + 2 ^ 11 + 2 ^ 16 + 2 ^ 21, which comes from how the encoding works:

  • 1-byte characters have 7 bits for encoding 0xxxxxxx (0x00-0x7F)

  • 2-byte characters have 11 bits for encoding 110xxxxx 10xxxxxx (0xC0-0xDF for the first byte, 0x80-0xBF for the second)

  • 3-byte characters have 16 bits for encoding 1110xxxx 10xxxxxx 10xxxxxx (0xE0-0xEF for the first byte, 0x80-0xBF for continuing bytes)

  • 4-byte characters have 21 bits for encoding 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (0xF0-0xF7 for the first byte, 0x80-0xBF for the continuation of bytes)

As you can see, this is significantly larger than the current Unicode (1,112,064 characters).

+3
Oct 29 '17 at 12:08 on
source share

Check out the Unicode standard and related information, such as the FAQ entry, UTF-8, UTF-16, UTF-32, and the spec . This is not just swimming, but its authoritative information, and much of what you can read about UTF-8 elsewhere is questionable.

β€œ8” in β€œUTF-8” refers to the length of the code blocks in bits. Code units are objects used to encode characters, and not as a simple one-to-one mapping. UTF-8 uses a variable number of code blocks to encode a character.

The collection of characters that can be encoded in UTF-8 is exactly the same as for UTF-16 or UTF-32, namely all Unicode characters. All of them encode the entire Unicode encoding, which even includes uncharacteristic and unassigned codes.

+2
Apr 19 2018-12-12T00:
source share

Unicode vs UTF-8

Unicode resolves codes points to characters. UTF-8 is a Unicode storage engine. Unicode has a specification. UTF-8 has a specification. Both of them have different borders. UTF-8 has a different upper bound.

Unicode

Unicode is indicated by "planes." Each plane contains 2 16 code points. Unicode has 17 aircraft. A total of 17 * 2^16 code points. the first plane, plane 0 or BMP , is special in the weight of what it carries.

Instead of explaining all the nuances, let me just cite the above article on airplanes.

17 aircraft can accommodate 1,114,112 code points. Of these, 2048 are surrogates, 66 are non-symbolic, and 137,468 are reserved for personal use, leaving 974,530 for public use.

Utf-8

Now back to the article above

The coding scheme used by UTF-8 was designed with a much larger code point limit of 2 31 (32,768 planes) and can code 2 21 (32) code points, even if they are limited to 4 bytes. [3] Since Unicode restricts code, it points to 17 planes that can be encoded with UTF-16; codes above 0x10FFFF are not allowed in UTF-8 and UTF-32.

So you can see that you can put stuff in UTF-8, which is not valid Unicode. What for? Because UTF-8 supports code points that Unicode does not even support.

UTF-8, even with a four-byte limit, supports 2 21 code points, which is much more than 17 * 2^16

+2
Jul 11 '17 at 18:58
source share

While I agree with mpen on the current maximum UTF-8 codes (2,164,864) (listed below, I could not comment on it), it is disabled at 2 levels if you remove the 2 main UTF-8 restrictions: only 4 byte restriction and 254 codes and 255 cannot be used (it deleted only 4 bytes).

Start code 254 follows the basic layout of start bits (bit with multiple bits set to 1, count 6 1 and terminal 0, no spare bits), which gives you 6 extra bytes to work with (6 groups 10xxxxxx, additional codes 2 ^ 36) .

The source code 255 does not exactly correspond to the basic setting, there is no terminal 0, but all bits are used, which gives you 7 additional bytes (the multi-bit flag is set to 1, the count is 7 1 and without terminal 0, because all bits, 7 groups are 10xxxxxx, additional codes 2 ^ 42).

Adding these values ​​gives the final maximum presentable character set of 4,468,982,745,216. This is more than all characters in current use, old or dead languages, and any verified lost languages. Angelic or Celestial script anyone?

There are also single byte codes that are ignored / ignored in the UTF-8 standard in addition to 254 and 255: 128-191 and several others. Some of them are used locally by the keyboard, for example, code 128 is usually the removal of backspace. Other start codes (and related ranges) are invalid for one or more reasons ( https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences ).

+1
Aug 14 '16 at 21:52
source share



All Articles