How does UTF-8 encoding identify single-byte and double-byte characters?

I recently ran into a character encoding problem while I was looking for a character set and character encoding. This doubt occurred to me. UTF-8 encoding is the most popular because of its backward compatibility with ASCII.Since UTF-8 is a variable-length encoding format, as it distinguishes between single-byte and double-byte characters. For example, “A ݔ” is stored as “410754” (Unicode for A is 41, and Unicode for the Arabic character is 0754. How does the encoding identify 41, is this one character and 0754 is another double-byte character? Why is it not considered as 4107 as one double-byte character and 54 as a single-byte character?

+16
source share
3 answers

For example, "A ݔ" is saved as "410754"

This is not how UTF-8 works.

Characters U + 0000 through U + 007F (aka ASCII) are stored as single bytes. They are the only characters whose code points numerically match their UTF-8 representation. For example, U + 0041 becomes 0x41, which is 0100001in binary format.

All other characters are represented by several bytes. U + 0080 through U + 07FF use two bytes each, U + 0800 through U + FFFF use three bytes each, and U + 10000 - U + 10FFFF use four bytes each.

, , , UTF-8 , , ASCII, , . 0x00 0x7F ASCII ; 0x7F . , , , .

- . :

  • 2 : 110xxxxx 10xxxxxx
  • 3 : 1110xxxx 10xxxxxx 10xxxxxx
  • 4 : 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

, . , , 10 . , x.

: U + 0754 U + 0080 U + 07FF, . 0x0754 11101010100, x :

110 11101 10 010100

+27

:

UTF-8 :

  • 1- ( ASCII) 0
  • 1 , 0 ( 110)
  • 3- 1 , 0 (.. 1110)
  • 1 , 0 (.. 11110)
  • ( ) 1, 0 (.. 10)

, U- Unicode U + 0041 U + 0754, UTF-8 :

0 1000001 110 11101 10 010100

, UTF-8 , 1- , 2- , , 2- , 2- .


, UTF-8 Unicode.

+15

, ASCII 7- ASCII, 8- ASCII, .

, ( 0x80 0xFF) , ( 0x0800 0xFFFF) .

1.114.111 16.777.215

xls

, "" NUL (0), .

, - !

+1

Source: https://habr.com/ru/post/1679337/


All Articles