UTF8 handling in C

Question

UTF8 handling in C

I have a basic understanding of UTF8: code points are variable in length, so a "character" can be 8 bits, 16 bits or even more.

I am wondering if there is an example code, library, etc. in C, which does similar things in the UTF8 string, for example, in the standard library in C. For example. indicate the length of the string, etc.

Thanks,

+6

c unicode utf-8

lang2 Jun 08 '12 at 11:46

source share

2 answers

tchrist · Answer 1 · 2012-06-10T02:06:27+0000

GNU has a Unicode string library called libunistring , but it does not handle anything much like ICU .

For example, the GNU library does not even give you access to sorting, which is the basis for comparing all strings. On the contrary, ICU. Another that the ICU has, that GNU does not appear, is a Unicode regex. For this, you can use Phil Hazels' excellent PCRE library for C , which can be compiled with UTF-8 support.

However, perhaps the GNU library is enough for what you need. I do not like its API. A terrible mess. If you like C programming, you can try the Go programming language , which has excellent Unicode support. Its a new language, but small, clean and fun to use.

On the other hand, the main interpreted languages - Perl, Python and Ruby - have different Unicode support, which is better than you ever get in C. Of these, Perl Unicode support is the most developed and reliable.

Remember: this is not enough to support more characters. Without the rules that go with them, you have no Unicode. At best, you might have ISO 10646: a character’s large repertoire, but no rules. My mantra "Unicode is not just characters, but more characters plus a whole bunch of rules for processing them."

Mr lister · Answer 2 · 2012-06-08T11:58:27+0000

An advanced library for processing Unicode IBM ICUs .

But if you just need to determine the number of code points in a UTF-8 encoded string, count the number of characters with values between \x01 and \x7F or between \xC2 and \xFF .

UTF8 handling in C

More articles: