Spec justification for & # x80; to & # x9F; in UTF-8 documents browser behavior wanted

HTML 4.01 specs about hexadecimal symbolic links

In numeric symbolic links, the position of the character code in the character set of the document is indicated.

So, if the document character set encoding is UTF-8, numeric references must indicate a Unicode code point.

HTML5 specification for hexadecimal character references

The ampersand should be followed by the character U + 0023 NUMBER SIGN (#), followed by either the U + 0078 LATIN SMALL LETTER X character (x) or the character U + 0058 LATIN CAPITAL LETTER X (X), which should then be followed by one or more digits in the range U + 0030 DIGIT ZERO (0) - U + 0039 DIGIT NINE (9), U + 0061 LATIN SMALL LETTER A to U + 0066 LATIN SMALL LETTER F and U + 0041 LATIN CAPITAL LETTER A to U + 0046 LATIN CAPITAL A LETTER F representing an integer of sixteen which corresponds to a Unicode code point, which is permitted as defined below. Then the numbers must be followed by U + 003B SEMICOLON symbol (;).

The document character set is not mentioned, and it just says that the numeric value identifies the Unicode code point.

But it seems that all the modern browsers (I haven't tested older ones) treat & # x80; through & # x9F; as if they were referencing Windows-1252

For example, & # x80; displays , but U + 0080 isn't the code point for , U + 20AC is. And the Unicode code point for U + 0080 is defined as PAD

& # x20AC; also (correctly) displays .

Is this just the pragmatic behavior of browsers, or is there an excuse for the spec I'm missing?

[Note that decimal references have the same behavior. I just used hexadecimal for clarity and consistency.]

+4
source share
2 answers

I found the answer to my question. It in the tokenization section of the HTML5 parsing algorithm uses a symbol reference that defines the mapping for these symbols.

+5
source

As I have done here , I will bring Wikipedia again:

Numeric references always refer to Unicode code points, regardless of page encoding. The use of numeric links that refer to permanently undefined characters and control characters is prohibited, with the exception of line feeds, tabs, and carriage returns. That is, characters in the hexadecimal ranges 00-08, 0B-0C, 0E-1F, 7F and 80-9F cannot be used in an HTML document, not even by reference, therefore ™ , for example, is not allowed, however, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the range 80-9F are interpreted by some browsers as representing characters mapped to bytes 80-9F in Windows encoding- 1252.

So this seems like an obsolete problem.

+3
source

Source: https://habr.com/ru/post/1387747/


All Articles