How does decoding in UTF-8 know byte boundaries?

I did a bunch of Unicode readings, especially regarding Python. I think I have a pretty strong understanding of this now, but there is another small detail that I'm a little unsure about.

How does decoding know byte boundaries? For example, let's say that I have a unicode string with two Unicode characters with byte representations \xc6\xb4and \xe2\x98\x82, respectively. Then I write this unicode line to the file, so the file now contains bytes \xc6\xb4\xe2\x98\x82. Now I decided to open and read the file (and Python decrypts the file as utf-8 by default), which leads me to my main question.

How does decoding know how to interpret bytes \xc6\xb4rather than \xc6\xb4\xe2?

+4
source share
1 answer

Byte boundaries are easily determined from bit patterns. In your case, it \xc6starts with a bit 1100, but \xe2starts with 1110. In UTF-8 (and I'm sure this is not an accident), you can determine the number of bytes of the entire character by looking only at the first byte and counting the number of bits 1at the beginning first 0. So your first character has 2 bytes and the second is 3 bytes.

If a byte begins with 0, it is a regular ASCII character.

If a byte begins with 10, it is part of a UTF-8 sequence (not the first character).

+3
source

Source: https://habr.com/ru/post/1543813/


All Articles