While I agree with mpen on the current maximum UTF-8 codes (2,164,864) (listed below, I could not comment on it), it is disabled at 2 levels if you remove the 2 main UTF-8 restrictions: only 4 byte restriction and 254 codes and 255 cannot be used (it deleted only 4 bytes).
Start code 254 follows the basic layout of start bits (bit with multiple bits set to 1, count 6 1 and terminal 0, no spare bits), which gives you 6 extra bytes to work with (6 groups 10xxxxxx, additional codes 2 ^ 36) .
The source code 255 does not exactly correspond to the basic setting, there is no terminal 0, but all bits are used, which gives you 7 additional bytes (the multi-bit flag is set to 1, the count is 7 1 and without terminal 0, because all bits, 7 groups are 10xxxxxx, additional codes 2 ^ 42).
Adding these values ββgives the final maximum presentable character set of 4,468,982,745,216. This is more than all characters in current use, old or dead languages, and any verified lost languages. Angelic or Celestial script anyone?
There are also single byte codes that are ignored / ignored in the UTF-8 standard in addition to 254 and 255: 128-191 and several others. Some of them are used locally by the keyboard, for example, code 128 is usually the removal of backspace. Other start codes (and related ranges) are invalid for one or more reasons ( https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences ).
James V. Fields Aug 14 '16 at 21:52 2016-08-14 21:52
source share