How to get Unicode for Chracter strings (UTF-8) in c or C ++ (Linux)

I am working on one application in which I need to know Unicode characters in order to classify them as Chinese characters, Japanese characters (kanji, katakana, hiragana), Latin, Greek, etc.

This line is in UTF-8 format.

If there is any way to find out Unicode for a UTF-8 character? For instance:

  • The character 'β‰ ' has a U + 2260 Unicode value.
  • The character 'ε»Ί' has a U + 5EFA Unicode value.
+4
source share
2 answers

Utf-8 encoding is a variable-width encoding for Unicode. Each Unicode code point can be encoded from one to four char .

To decode the char* string and extract one code point, you read one byte. If the most significant bit is specified, then the code point is encoded with several characters, otherwise it is a Unicode code point. The number of bits set to count from the most significant bit indicates how much char used to encode the Unicode code point.

This table explains how to do the conversion:

 UTF-8 (char*) | Unicode (21 bits) ------------------------------------+-------------------------- 0xxxxxxx | 00000000000000000xxxxxxx ------------------------------------+-------------------------- 110yyyyy 10xxxxxx | 0000000000000yyyyyxxxxxx ------------------------------------+-------------------------- 1110zzzz 10yyyyyy 10xxxxxx | 00000000zzzzyyyyyyxxxxxx ------------------------------------+-------------------------- 11110www 10zzzzzz 10yyyyyy 10xxxxxx | 000wwwzzzzzzyyyyyyxxxxxx 

Based on this, the code is simple enough to write. If you do not want to write it, you can use the library that performs the conversion for you. There are many available under Linux: libiconv , icu , glib , ...

+4
source

libiconv can help you convert utf-8 string to utf-16 or utf-32. Utf-32 would be the most reliable option if you really want to support all possible unicode codes.

+1
source

Source: https://habr.com/ru/post/1345200/


All Articles