I did a bunch of Unicode readings, especially regarding Python. I think I have a pretty strong understanding of this now, but there is another small detail that I'm a little unsure about.
How does decoding know byte boundaries? For example, let's say that I have a unicode string with two Unicode characters with byte representations \xc6\xb4and \xe2\x98\x82, respectively. Then I write this unicode line to the file, so the file now contains bytes
\xc6\xb4\xe2\x98\x82. Now I decided to open and read the file (and Python decrypts the file as utf-8 by default), which leads me to my main question.
How does decoding know how to interpret bytes \xc6\xb4rather than \xc6\xb4\xe2?
source
share