Does Unicode have a certain maximum number of code points?

I read many articles to find out what is the maximum number of Unicode codes, but I did not find a definitive answer.

I realized that Unicode code points were minimized to make all UTF-8 UTF-16 and UTF-32 encodings capable of handling the same number of code points. But what is the number of code points?

The most common answer I came across is that the Unicode code codes are in the range 0x000000 to 0x10FFFF (1,114,112 code points), but I also read in other places that these are 1112114 code points. So, is there one number to be asked or is the problem more complicated?

+8
source share
3 answers

The maximum allowed code point in Unicode is U + 10FFFF, which makes it a 21-bit code set (but not all 21-bit integers are valid Unicode code points; in particular, values ​​from 0x110000 to 0x1FFFFF are not valid Unicode code points).

This is where the number 1,114,112 comes from: U + 0000 .. U + 10FFFF is 1,114,112 values.

However, there is also a set of code points that are surrogates for UTF-16. They are in the range U + D800 .. U + DFFF. These are 2048 code points that are reserved for UTF-16.

1,114,112 - 2,048 = 1,112,064

There are also 66 non-characters. They are partially defined in Corrigendum No. 9 : 34 values ​​in the form U + nFFFE and U + nFFFF (where n is the value 0x00000, 0x10000, ... 0xF0000, 0x100000) and 32 values ​​U + FDD0 - U + FDEF. Subtracting them, too, we get 1111 998 characters to be allocated. Three ranges are reserved for private use: U + E000 .. U + F8FF, U + F0000 .. U + FFFFD and U + 100000 .. U + 10FFFD. And the number of actually assigned values ​​depends on the version of Unicode you are viewing. You can find information about the latest version in the Unicode Consortium . Among other things, the introduction says:

Unicode Standard, Version 7.0, Contains 112,956 Characters

Thus, only about 10% of the available code points were allocated.

I can’t explain why you found 1,112,114 as the number of code points.

Incidentally, the upper limit of U + 10FFFF is chosen so that all values ​​in Unicode can be represented in one or two 2-byte coding units in UTF-16, using one high surrogate and one low surrogate to represent values ​​outside of BMP or Basic. , which is in the range U + 0000 .. U + FFFF.

+20
source

yes, all code points that cannot be represented in UTF-16 (including the use of surrogates) have been declared invalid.

U + 10FFD seems to be the highest code point, but the surrogates u + 00FFFE and u + 00FFFF are not useful code points, so the total number of bits is slightly lower.

+1
source

I made a very small routine that displays a very long table, from 0 to n values, where the start of var is a number that the user can configure by the user. This is a snippet:

function getVal() { var start = parseInt(document.getElementById('start').value); var range = parseInt(document.getElementById('range').value); var end = start + range; return [start, range, end]; } function next() { var values = getVal(); document.getElementById('start').value = values[2]; document.getElementById('ok').click(); } function prev() { var values = getVal(); document.getElementById('start').value = values[0] - values[1]; document.getElementById('ok').click(); } function renderCharCodeTable() { var values = getVal(); var start = values[0]; var end = values[2]; const MINSTART = 0; // Allowed range const MAXEND = 4294967294; // Allowed range start = start < MINSTART ? MINSTART : start; end = end < MINSTART ? (MINSTART + 1) : end; start = start > MAXEND ? (MAXEND - 1) : start; end = end >= MAXEND ? (MAXEND + 1) : end; var tr = []; var unicodeCharSet = document.getElementById('unicodeCharSet'); var cCode; var cPoint; for (var c = start; c < end; c++) { try { cCode = String.fromCharCode(c); } catch (e) { cCode = 'fromCharCode max val exceeded'; } try { cPoint = String.fromCodePoint(c); } catch (e) { cPoint = 'fromCodePoint max val exceeded'; } tr[c] = '<tr><td>' + c + '</td><td>' + cCode + '</td><td>' + cPoint + '</td></tr>' } unicodeCharSet.innerHTML = tr.join(''); } function startRender() { setTimeout(renderCharCodeTable, 100); console.time('renderCharCodeTable'); } unicodeCharSet.addEventListener("load",startRender()); 
 body { margin-bottom: 50%; } form { position: fixed; } table * { border: 1px solid black; font-size: 1em; text-align: center; } table { margin: auto; border-collapse: collapse; } td:hover { padding-bottom: 1.5em; padding-top: 1.5em; } tbody > tr:hover { font-size: 5em; } 
 <form> Start Unicode: <input type="number" id="start" value="0" onchange="renderCharCodeTable()" min="0" max="4294967300" title="Set a number from 0 to 4294967294" > <p></p> Show <input type="number" id="range" value="30" onchange="renderCharCodeTable()" min="1" max="1000" title="Range to show. Insert a value from 10 to 1000" > symbols at once. <p></p> <input type="button" id="pr" value="◄◄" onclick="prev()" title="Mostra precedenti" > <input type="button" id="nx" value="►►" onclick="next()" title="Mostra successivi" > <input type="button" id="ok" value="OK" onclick="startRender()" title="Ok" > <input type="reset" id="rst" value="X" onclick="startRender()" title="Reset" > </form> <table> <thead> <tr> <th>CODE</th> <th>Symbol fromCharCode</th> <th>Symbol fromCodePoint</th> </tr> </thead> <tbody id="unicodeCharSet"> <tr><td colspan="2">Rendering...</td></tr> </tbody> </table> 

Run it for the first time, then open the code and set the value of the start variable to a very large number, slightly less than the value of MAXEND. Here is what I got:

  code equivalent symbol {~~~ first execution output example ~~~~~} 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 ! 34 " 35 # 36 $ 37 % 38 & 39 ' 40 ( 41 ) 42 * 43 + 44 , 45 - 46 . 47 / 48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7 56 8 57 9 {~~~ second execution output example ~~~~~} 4294967275 → 4294967276 ↓ 4294967277 ■ 4294967278 ○ 4294967279 ￯ 4294967280 ￰ 4294967281 ￱ 4294967282 ￲ 4294967283 ￳ 4294967284 ￴ 4294967285 ￵ 4294967286 ￶ 4294967287 ￷ 4294967288 ￸ 4294967289  4294967290  4294967291  4294967292  4294967293   4294967294 

The output is of course truncated (between the first and second execution) because it is too long.

After 4294967294 (= 2 ^ 32) the function inexorably stops, so I assume that it has reached its maximum possible value: therefore, I interpret this as the maximum possible value of the Unicode char code table. Of course, as the other answers say, not all char have equivalent characters, but often they are empty, as the example showed. There are also many characters that are repeated multiple times at different points between 0 ... 4294967294 char codes

Edit: improvements

(thanks @duskwuff)

Now you can also compare the behavior of String.fromCharCode and String.fromCodePoint . Note that the first statement comes to 4294967294, but the result is repeated every 65536 (16 bits = 2 ^ 16). The latter stops working with code 1114111 (because the unicode char list and characters start with 0, we have a total of 1,114,112 Unicode codes, but as other answers say, not all of them are valid in the sense that they are empty dots). Also remember that in order to use a specific unicode char, you need to have the corresponding font with the corresponding char defined in it. If not, you will see an empty unicode char or an empty square char.

enter image description here

Note:

I noticed that on some Android systems using the Chrome browser for Android, js String.fromCodePoint returns an error for all code pages.

-5
source

Source: https://habr.com/ru/post/1208783/


All Articles