How does UTF-8 encoding identify single-byte and double-byte characters?

Question

How does UTF-8 encoding identify single-byte and double-byte characters?

I recently ran into a character encoding problem while I was looking for a character set and character encoding. This doubt occurred to me. UTF-8 encoding is the most popular because of its backward compatibility with ASCII.Since UTF-8 is a variable-length encoding format, as it distinguishes between single-byte and double-byte characters. For example, “A ݔ” is stored as “410754” (Unicode for A is 41, and Unicode for the Arabic character is 0754. How does the encoding identify 41, is this one character and 0754 is another double-byte character? Why is it not considered as 4107 as one double-byte character and 54 as a single-byte character?

+16

encoding unicode utf-8 character-encoding

Ganesh kumar sr Jun 15 '17 at 11:03

source share

3 answers

:

UTF-8 :

1- ( ASCII) 0
1 , 0 ( 110)
3- 1 , 0 (.. 1110)
1 , 0 (.. 11110)
( ) 1, 0 (.. 10)

Aݔ, U- Unicode U + 0041 U + 0754, UTF-8 :

0 1000001 110 11101 10 010100

, UTF-8 , 1- , 2- , , 2- , 2- .

, UTF-8 Unicode.

+15

weibeld 27 . '17 9:10

, ASCII 7- ASCII, 8- ASCII, .

, ( 0x80 0xFF) , ( 0x0800 0xFFFF) .

1.114.111 16.777.215

xls

, "" NUL (0), .

, - !

+1

jmcollantes 26 . '19 16:34

CharlotteBuff · Accepted Answer · 2017-06-15T12:56:44+0000

For example, "A ݔ" is saved as "410754"

This is not how UTF-8 works.

Characters U + 0000 through U + 007F (aka ASCII) are stored as single bytes. They are the only characters whose code points numerically match their UTF-8 representation. For example, U + 0041 becomes 0x41, which is 0100001in binary format.

All other characters are represented by several bytes. U + 0080 through U + 07FF use two bytes each, U + 0800 through U + FFFF use three bytes each, and U + 10000 - U + 10FFFF use four bytes each.

, , , UTF-8 , , ASCII, , . 0x00 0x7F ASCII ; 0x7F . , , , .

- . :

2 : 110xxxxx 10xxxxxx
3 : 1110xxxx 10xxxxxx 10xxxxxx
4 : 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

, . , , 10 . , x.

: U + 0754 U + 0080 U + 07FF, . 0x0754 11101010100, x :

110 11101 10 010100

How does UTF-8 encoding identify single-byte and double-byte characters?

More articles: