Why is the degree symbol different from UTF-8 from unicode?

Why is the degree symbol different from UTF-8 from unicode?

According to: http://www.utf8-chartable.de/ and http://www.fileformat.info/info/unicode/char/b0/index.htm

unicode B0 , but UTF-8 C2 B0 . How come ???

+6
source share
4 answers

UTF-8 is a method of encoding UTF characters using a variable number of bytes (the number of bytes depends on the code point).

Code points between U + 0080 and U + 07FF use the following two - byte encoding :

110xxxxx 10xxxxxx 

where x represent the bits of the encoded point.

Consider U + 00B0. In binary format, 0xB0 is 10110000. If one is replaced with bits in the above pattern, it turns out:

  11000010 10110000 

In hexadecimal, this is 0xC2 0xB0.

+17
source

Unicode (UTF-16 and UTF-32) uses the 0x00B0 code point for this character. UTF-8 does not allow characters with values โ€‹โ€‹higher than 127 ( 0x007F ), since the most significant bit of each byte is reserved to indicate that this particular character is actually multi-byte.

Basic 7-bit ASCII maps directly to the first 128 characters of UTF-8. Any characters with values โ€‹โ€‹higher than 127 decimal (7F hex) must be "escaped" by setting a high bit and adding 1 or more additional bytes for the description.

+4
source

UTF-8 is one Unicode encoding. UTF-16 and UTF-32 are other Unicode encodings.

Unicode defines a numeric value for each character; the degree symbol has a value of 0xB0 or โ€‹โ€‹176 in decimal value. Unicode does not determine how these numeric values โ€‹โ€‹are represented.

UTF-8 encodes the value 0xB0 as two consecutive octets (bytes) with the values 0xC2 0xB0 .

UTF-16 encodes the same value either as 0x00 0xB0 , or as 0xBo 0x00 , depending on the entity.

UTF-32 encodes it as 0x00 0x00 0x00 0xB0 or as 0xB0 0x00 0x00 0x00 , again depending on the entity (I assume other orderings are possible).

+4
source

Answers from NPE, Marc and Keith are good and above my knowledge on this topic. However, I had to read them a couple of times before I realized what that meant. Then I saw this web page that made it a โ€œclickโ€ for me.

At http://www.utf8-chartable.de/ you can see the following:

UTF-8 needs C2 80 to represent U + 0080

Note how you need to use two bytes to encode a single character. Now read the accepted answer from NPE.

+1
source

Source: https://habr.com/ru/post/905125/


All Articles