Who performs Unicode normalization and when?

Question

Who performs Unicode normalization and when?

JavaScript is the definitive guide,

JavaScript assumes that the source code that it interprets is already normalized and does not attempt to normalize identifiers, strings, or regular expressions.
The Unicode standard defines the preferred encoding for all characters and defines a normalization procedure for converting text to a canonical form suitable for comparisons.

If JS does not normalize Unicode, then who does it and when?

If JavaScript doesn't normalize Unicode, then how

"café" === "caf\u00e9" // => true

and why

 "café" === "cafe\u0301" // => false

Since both ( \u00e9 and e\u0301 ) are Unicode methods for forming é.

+5

javascript unicode

Harshit Juneja Jul 23 '17 at 18:43

source share

1 answer

spectras · Accepted Answer · 2017-07-23T18:56:49+0000

You are confusing Unicode normalization and string escaping.

 "café"

... is a string consisting of characters with code points 0x63, 0x61, 0x66, 0xe9.

You can get exactly the same string using an escaped representation

 "caf\u00e9" // or even "\u0063\u0061\u0066\u00e9" // or why not "\u0063\u0061fé"

When reading such a line, javascript does not delete the line. That is, it replaces the escape sequence with the corresponding characters. This is the same process that replaces "\ n" with a new line.

Now your second example is actually a different line, as it is not normalized. This is a string consisting of characters 0x63, 0x61, 0x66, 0x65, 0x301. Since normalization does not occur, this is not the same line.

Now try to use the same line using this sequence, which you cannot enter using your keyboard, but what I will copy here: "café" . Check it out now:

 > a = "café" // this one is copy-pasted with the combining acute > b = "café" // this one is typed using the "é" key on my keyboard > a === "cafe\u0301" <- true > b === "cafe\u0301" <- false > a === "caf\u00e9" <- false > b === "caf\u00e9" <- true > a === b <- false // Now just making sure... > a.length <- 5 > b.length <- 4

The fact that café and café are displayed the same does not make them the same line. JavaScript compares the strings, detects that 0x63, 0x61, 0x66, 0xe9 does not match 0x63, 0x61, 0x66, 0x65, 0x301 and returns false.

Who performs Unicode normalization and when?

More articles: