How to get Unicode for Chracter strings (UTF-8) in c or C ++ (Linux)

Question

How to get Unicode for Chracter strings (UTF-8) in c or C ++ (Linux)

I am working on one application in which I need to know Unicode characters in order to classify them as Chinese characters, Japanese characters (kanji, katakana, hiragana), Latin, Greek, etc.

This line is in UTF-8 format.

If there is any way to find out Unicode for a UTF-8 character? For instance:

The character '≠' has a U + 2260 Unicode value.
The character '建' has a U + 5EFA Unicode value.

+4

c ++ c unicode utf-8

Ashish yadav Mar 25 '11 at 7:08

source share

2 answers

libiconv can help you convert utf-8 string to utf-16 or utf-32. Utf-32 would be the most reliable option if you really want to support all possible unicode codes.

+1

Eelke Mar 25 '11 at 7:45

source share

Sylvain defresne · Accepted Answer · 2011-03-25T07:44:26+0000

Utf-8 encoding is a variable-width encoding for Unicode. Each Unicode code point can be encoded from one to four char .

To decode the char* string and extract one code point, you read one byte. If the most significant bit is specified, then the code point is encoded with several characters, otherwise it is a Unicode code point. The number of bits set to count from the most significant bit indicates how much char used to encode the Unicode code point.

This table explains how to do the conversion:

 UTF-8 (char*) | Unicode (21 bits) ------------------------------------+-------------------------- 0xxxxxxx | 00000000000000000xxxxxxx ------------------------------------+-------------------------- 110yyyyy 10xxxxxx | 0000000000000yyyyyxxxxxx ------------------------------------+-------------------------- 1110zzzz 10yyyyyy 10xxxxxx | 00000000zzzzyyyyyyxxxxxx ------------------------------------+-------------------------- 11110www 10zzzzzz 10yyyyyy 10xxxxxx | 000wwwzzzzzzyyyyyyxxxxxx

Based on this, the code is simple enough to write. If you do not want to write it, you can use the library that performs the conversion for you. There are many available under Linux: libiconv , icu , glib , ...

How to get Unicode for Chracter strings (UTF-8) in c or C ++ (Linux)

More articles: