Is Unicode competition going to make UTF-16 complete?

The current version of UTF-16 is capable of encoding only 1112,064 different numbers (code points); 0x0-0x10FFFF .

Unicode competition intends to force UTF-16 to end?

i.e. make code point> 0x10FFFF

If not, why would anyone write code for the utf-8 parser to be able to accept 5 or 6 byte sequences? Since this will add unnecessary instructions to their function.

Is 1,112,064 enough, do we really need more characters? I mean: how fast do we finish?

+4
source share
4 answers

As of 2011, we consume 109,449 characters and set aside for use in the application (6,400 + 131,068) :

leaving room for over 860,000 unused characters; a lot for the CJK E extension (~ 10,000 characters) and another 85 sets similar to it; so in case of contact with Ferengi culture , we must be prepared.

In November 2003, the IETF restricted UTF-8 to the end at U + 10FFFF using RFC 3629 to comply with UTF-16 character encoding restrictions: the UTF-8 parser should not accept 5 or 6 byte sequences that overwhelm the utf-16 set or characters in 4 bytes that are greater than 0x10FFFF

Please post lists of rights sets that pose a threat by the size of the Unicode code limit if they exceed 1/3. Size Extension CJK E (~ 10,000 characters):

+4
source

Currently, the Unicode standard does not define any characters above U + 10FFFF, so it would be nice to encode your application to reject characters above this point.

It is difficult to predict the future, but I think that you are safe for the near future with this strategy. Honestly, even if Unicode extends U + 10FFFF in the distant future, it will almost certainly not be for critical critical characters. Your application may not be compatible with the new Ferengi fonts that were released in 2063, but you can always fix it when it really becomes a problem.

+1
source

Cutting to the chase:

Indeed, it is assumed that the coding system only supports code points up to U + 10FFFF

It seems that at some point there is a real risk of exhaustion.

+1
source

There is no reason to write a UTF-8 parser that supports 5-6 byte sequences, with the exception of supporting any existing systems that actually used them. The current official UTF-8 specification does not support sequences of 5-6 bytes to ensure 100% conversion without loss to / from UTF-16. If Unicode always has support for new code points above U+10FFFF , there will be a lot of time to develop new encoding formats for higher bit values. Or maybe by the time this happens, there will be enough memory and processing power to make everyone switch to UTF-32 for anything that can handle up to 4 billion characters up to U+FFFFFFFF .

0
source

Source: https://habr.com/ru/post/1397662/


All Articles