Why is the degree symbol different from UTF-8 from unicode?

Question

Why is the degree symbol different from UTF-8 from unicode?

According to: http://www.utf8-chartable.de/ and http://www.fileformat.info/info/unicode/char/b0/index.htm

unicode B0 , but UTF-8 C2 B0 . How come ???

+6

unicode utf-8

Muhammad Hewedy Jan 4 '12 at 18:33

source share

4 answers

Unicode (UTF-16 and UTF-32) uses the 0x00B0 code point for this character. UTF-8 does not allow characters with values higher than 127 ( 0x007F ), since the most significant bit of each byte is reserved to indicate that this particular character is actually multi-byte.

Basic 7-bit ASCII maps directly to the first 128 characters of UTF-8. Any characters with values higher than 127 decimal (7F hex) must be "escaped" by setting a high bit and adding 1 or more additional bytes for the description.

+4

Marc b Jan 4 '12 at 18:40

source share

UTF-8 is one Unicode encoding. UTF-16 and UTF-32 are other Unicode encodings.

Unicode defines a numeric value for each character; the degree symbol has a value of 0xB0 or 176 in decimal value. Unicode does not determine how these numeric values are represented.

UTF-8 encodes the value 0xB0 as two consecutive octets (bytes) with the values 0xC2 0xB0 .

UTF-16 encodes the same value either as 0x00 0xB0 , or as 0xBo 0x00 , depending on the entity.

UTF-32 encodes it as 0x00 0x00 0x00 0xB0 or as 0xB0 0x00 0x00 0x00 , again depending on the entity (I assume other orderings are possible).

+4

Keith thompson Jan 4 '12 at 19:21

source share

Answers from NPE, Marc and Keith are good and above my knowledge on this topic. However, I had to read them a couple of times before I realized what that meant. Then I saw this web page that made it a “click” for me.

At http://www.utf8-chartable.de/ you can see the following:

Note how you need to use two bytes to encode a single character. Now read the accepted answer from NPE.

+1

Tormod Mar 16 '14 at 7:17

source share

NPE · Accepted Answer · 2012-01-04T18:39:15+0000

UTF-8 is a method of encoding UTF characters using a variable number of bytes (the number of bytes depends on the code point).

Code points between U + 0080 and U + 07FF use the following two - byte encoding :

110xxxxx 10xxxxxx

where x represent the bits of the encoded point.

Consider U + 00B0. In binary format, 0xB0 is 10110000. If one is replaced with bits in the above pattern, it turns out:

  11000010 10110000

In hexadecimal, this is 0xC2 0xB0.

Why is the degree symbol different from UTF-8 from unicode?

More articles: