ANSI C UTF-8 Problem

First, I develop an independent platform library using ANSI C (not C ++ and any non-standard libraries such as MS CRT or glibc, ...).

After a few searches, I found that one of the best internationalization methods in ANSI C uses UTF-8 encoding.

In utf-8:

  • strlen (s): always count the number of bytes .
  • mbstowcs (NULL, s, 0): the number of characters can be counted.

But I have some problems when I want random access to the elements (characters) of the utf-8 string.

ASCII encoded:

char get_char(char* assci_str, int n) { // It is very FAST. return assci_str[n]; } 

UTF-16/32 encoded:

 wchar_t get_char(wchar_t* wstr, int n) { // It is very FAST. return wstr[n]; } 

And here is my problem in UTF-8 encoding:

 // What is the return type? // Because sizeof(utf-8 char) is 8 or 16 or 24 or 32. /*?*/ get_char(char* utf8str, int n) { // I can found Nth character of string by using for. // But it is too slow. // What is the best way? } 

Thanks.

+6
source share
4 answers

Perhaps you think about it a little erroneously. UTF-8 is an encoding that is useful for serializing data, for example. writing it to a file or to the network. However, this is a very non-trivial encoding, and a raw Unicode code point string can end in any number of encoded bytes.

What you should probably do if you want to process the text (given your description) is to store raw , fixed-width strings inside. If you are going to use Unicode (what you need), you will need 21 bits per code, so the closest integral type is uint32_t . In short, keep all your lines inside, as arrays of integers. Then you can randomly access each code point.

Only encoding in UTF-8 when writing to a file or console and decoding from UTF-8 when reading.

By the way, Unicode code is still far from a character. The concept of the character is far enough from a high level to have a simple general mechanic. (For example, "a" + "accent grave" - ​​two code points, how many characters?)

+7
source

You just can't. If you need a lot of such queries, you can create an index for the UTF-8 string or convert it to UTF-32 in front. UTF-32 is the best performance in memory, while UTF-8 is good on disk.

By the way, the code you provided for UTF-16 is also incorrect. You may want to take care of surrogate characters.

+4
source

What do you want to calculate? As Kerrek S. noted, you can have decomposed glyphs, i.e. "É" can be represented as one character ( LATIN SMALL LETTER E WITH ACUTE U + 00E9) or as two characters ( LATIN SMALL LETER E LATIN SMALL LETTER E WITH ACUTE U + 0065 COMBINING ACUTE ACCENT U + 0301), Unicode has composed and decomposed normalization forms.

What you are probably interested in counting are not symbols, but grapheme clusters. You need to have a higher-level library to handle this, as well as dealing with normalization forms and correct (language-dependent) sorting, proper disassembly, proper flashing of flags (e.g. German ß-> SS) and bidi support, etc. d ... The real I18N is complicated.

+1
source

Unlike others, I don’t see the benefits of using UTF-32 instead of UTF-8: when processing text, grapheme clusters (or “user-friendly characters”) are much more useful than Unicode characters (for example, raw code points), so even UTF-32 should be considered like variable length encoding.

If you do not want to use a dedicated library, I suggest using UTF-8 as on disk, an endin-agnostic representation and a modified UTF-8 (which is different from UTF-8, encoding the null character as a two-byte sequence), as a representation in memory, compatible with ASCIIZ.

The necessary information for breaking strings into grapheme clusters can be found in Appendix 29 and .

0
source

Source: https://habr.com/ru/post/891603/


All Articles