Utf-8 encoding is a variable-width encoding for Unicode. Each Unicode code point can be encoded from one to four char .
To decode the char* string and extract one code point, you read one byte. If the most significant bit is specified, then the code point is encoded with several characters, otherwise it is a Unicode code point. The number of bits set to count from the most significant bit indicates how much char used to encode the Unicode code point.
This table explains how to do the conversion:
UTF-8 (char*) | Unicode (21 bits) ------------------------------------+-------------------------- 0xxxxxxx | 00000000000000000xxxxxxx ------------------------------------+-------------------------- 110yyyyy 10xxxxxx | 0000000000000yyyyyxxxxxx ------------------------------------+-------------------------- 1110zzzz 10yyyyyy 10xxxxxx | 00000000zzzzyyyyyyxxxxxx ------------------------------------+-------------------------- 11110www 10zzzzzz 10yyyyyy 10xxxxxx | 000wwwzzzzzzyyyyyyxxxxxx
Based on this, the code is simple enough to write. If you do not want to write it, you can use the library that performs the conversion for you. There are many available under Linux: libiconv , icu , glib , ...
source share