Unicode C11 Support

I am writing some string conversion functions similar to atoi() or strtoll() . I wanted to include a version of my function that would accept char16_t * or char32_t *, not just char * or wchar_t *.

My function works fine, but when I wrote it, I realized that I did not understand what char16_t or char32_t is. I know that the standard only requires that they be integer types of at least 16 or 32 bits respectively, but the implication is that they are UTF-16 or UTF-32.

I also know that the standard defines a couple of functions, but they did not include any * get or * put functions (as was the case when adding to wchar.h on C99).

So I'm wondering: what do they expect from me with char16_t and char32_t?

+6
source share
3 answers

This is a good question with no obvious answer.

The types and functions of uchar.h added in C11 are pretty much useless. They only support conversions between the new type ( char16_t or char32_t ) and the locale-specific, multibyte encoding, mappings that will not be complete if the language is not based on UTF-8. Useful conversions (to / from wchar_t and to / from UTF-8) are not supported. Of course, you can minimize your own conversions to / from UTF-8, since these conversions are 100% defined by the relevant RFC / UCS / Unicode standards, but be careful: most people implement them incorrectly and have dangerous errors.

Note that new compiler-level features for UTF-8, UTF-16, and UTF-32 u8 ( u8 , u and u , respectively) are potentially useful; you can process the resulting lines with your own functions in meaningful ways that are completely language-independent. But support at the Unicode library level in C11, in my opinion, is mostly useless.

+9
source

Testing if the charter of UTF-16 or UTF-32 in the ASCII range is one of the β€œusual” 10 digits, +, - or β€œnormal” white space is easy to do, and also convert '0'-'9' to a digit. Given that atoi_utf16/32() continues as atoi() . Just check one character at a time.

Testing if any other UTF-16 / UTF-32 is a number or a space is more difficult. The code will need extended isspace(), isdigit() , which can be switched locally ( setlocale() ) if an available locale is available. (Note: you may need to restore the locale when the function is executed.

Converting a character that passes isdigit() but is not one of the usual 10 to its value is problematic. In any case, it seems not even allowed.

Conversion Steps:

  • Set the locale for UTF-16 / UTF-32.

  • Use isspace() to detect white space.

  • The conversion is similar for your_atof() .

  • Restore local.

+3
source

This question may be a little old, but I would like to touch on the implementation of your functions with char16_t and char32_t support.

The easiest way to do this is to write your strtoull function using char32_t type (call it something like strtoull_c32 ). This simplifies Unicode parsing because each character in UTF-32 takes up four bytes. Then do strtoull_c16 and strtoull_c8 , internally converting the UTF-8 and UTF-16 UTF-32 to UTF-32 and passing them to strtoull_c32 .

I honestly did not look at Unicode objects in the C11 standard library, but if they do not provide a suitable way to convert these types to UTF-32 , then you can use a third-party library to do the conversion for you.

There is an ICU that was launched by IBM and then adopted by the Unicode Consortium. This is a very multifunctional and stable library that has existed for a long time.

I recently launched the UTF library ( UTFX ) for C89, which you could use to do this. This is a fairly simple and easy, tested and documented unit. You can give this or use it to find out more about how UTF conversions work.

0
source

Source: https://habr.com/ru/post/976020/


All Articles