Is injective mapping UTF8?

We are writing a C++ application and should know this:

Is UTF8 text encoding injective matching from bytes to characters, which means that every single character (letter ...) is encoded in only one way? So, for example, the letter 'Ž' cannot be encoded, like, say, both 3231 and 32119.

+6
source share
5 answers

It depends a lot on what you consider a “letter”.

UTF8 is basically a tiny snippet of what is Unicode.

Basically, there are at least three levels: bytes, code points, and Grapheme clusters. A code point may be encoded in one or more bytes in accordance with a particular encoding, for example, UTF8, UTF16, or UTF32. This encoding is unique (as all alternative methods are declared invalid). However, a code point is not always a glyph, because there are so-called character combinations. Such combining characters follow the base character and, as their name says, are combined with the base character. For example, there is a combined character U + 0308 COMBINING DIAERESIS, which puts a diaresis (¨) above the previous letter. Therefore, if this follows, for example, a (U + 0061 LATIN SMALL LETTER A), the result is a. However, there is one code point for the letter ä (U + 00E4 LATIN SMALL LETTER A WITH DIAERESIS), so this means that the code sequences U + 0061 U + 0308 and U + 00E4 describe the same letter.

Thus, each code point has the only valid UTF 8 encoding (for example, U + 0061 - "\ 141", U + 0308 - "\ 314 \ 210", and U + 00e4 - "\ 303 \ 244", but the letter ä encoded as a sequence of codes U + 0061 U + 0308, that is, in UTF8, a sequence of bytes "\ 141 \ 314 \ 210" and a single code point U + 00E4, that is, a sequence of bytes "\ 303 \ 244" ,.

To make matters worse, since Unicode manufacturers decided that the combined letters follow the base letter instead of the previous one, you cannot know if your glyph is complete until you see the next code point (if it is not a combining code point, your letter is complete).

+13
source

A valid UTF-8 does encode each character uniquely. However, there are so-called overlap sequences that correspond to the general coding scheme, but by definition are not valid, since only the shortest sequence can be used to encode a character.

For example, there is a derivative of UTF-8, called modified UTF-8, which encodes NUL as an overlapping sequence of 0xC0 0x80 instead of 0x00 , to obtain an encoding compatible with null characters.

If you ask about grapheme clusters (i.e. user-friendly characters) instead of characters, then even a valid UTF-8 is ambiguous. However, Unicode defines several different formats, and if you limit yourself to normalized strings, then UTF-8 is really injective.

A little off topic: Here are some ASCII art that I came up with to visualize various character concepts. Vertically separated is the human, abstract, and machine level. Feel free to come up with better names ...

  [user-perceived characters]<-+ ^ | | | v | [characters] <-> [grapheme clusters] | ^ ^ | | | | vv | [bytes] <-> [codepoints] [glyphs]<----------+ 

To return to the topic: This graph also shows where potential problems may arise when using bytes to compare abstract strings. In particular (subject to UTF-8), the programmer must make sure that

  • the sequence of bytes is valid, i.e. does not contain alternating sequences or encodes non-character codes
  • the character sequence is normalized so that the equivalent grapheme clusters have a unique representation
+6
source

First you need the terminology:

  • Letter : (abstract concept, not in Unicode), some letter or symbol that you want to represent.
  • Codepoint : The number associated with the Unicode character.
  • Grapheme cluster : a sequence of Unicode codes that correspond to a single letter, for example: a + ́ for the letter á .
  • Glyph : (concept at the font level, not Unicode): A graphical representation of the letter.

Each code (for example, U + 1F4A9) receives a unique representation in the form of bytes in UTF-8 (for example: 0xF0 0x9F 0x92 0xA9).

Some letters can be represented in several different ways in the form of code points (i.e., as different grapheme clusters). for example: á can be represented as one code number á (LATIN SMALL LETTER A WITH ACUTE), or it can be represented as a code point for a (LATIN SMALL LETTER A) + code for ́ COMBINATION OF ACUTE ACCENT). Unicode has several forms of canonical normalization to deal with this (for example: the NFC form or the canonical normalization form C is a free formation with fewer code points, and NFD is completely decomposed).

And then there are also ligatures (for example: ) and some other writing options related to the presentation (for example: superscript, without interruption, letters with different forms in different places of the word, ...). Some of them are in Unicode to allow conversion without conversion from old character sets. Unicode has compatibility normalization forms (NFKC and NFKD) to handle this.

+3
source

Yes. UTF-8 is the standard Unicode character encoding method. It was created so that there is only one way to encode each of the Unicode characters.

A little off topic: it may be useful to know that some characters are very similar in image (for people), but they are still different - for example, there is an icon in the Cyrillic alphabet that is very similar to '/',

+2
source

Yes, sort of. When used correctly, each Unicode code point should only be encoded in one direction in UTF-8, but this is partly due to the requirement that only the shortest applicable UTF-8 byte sequence should be used for any character.

However, the method used to encode characters can encode many characters in more than one way, if not for this requirement - and although this is not correct, there are some cases where this is done.

For example, “Z” can be encoded as 0x5a or {0xa1, 0x9a} (among others), although a single 0x5a is considered correct because it is the shortest.

0
source

Source: https://habr.com/ru/post/901394/


All Articles